[comp.lang.c] Efficient STRing CoMPares?

stone@mars.mpr.ca (Darren Stone) (03/16/91)

Does anyone know if there's a better (!?) way to compare
strings than with strcmp()?

I'm looking for any speed advantage I can get!

Thanx in advance...

- Darren -

wirzenius@cc.helsinki.fi (Lars Wirzenius) (03/17/91)

In article <1991Mar15.142821@mars.mpr.ca>, stone@mars.mpr.ca (Darren Stone) writes:
> Does anyone know if there's a better (!?) way to compare
> strings than with strcmp()?
> I'm looking for any speed advantage I can get!

For non-portability, the fastest way would probably be to write inline
assembly code using the machine's string compare instruction.  Of
course, this instruction doesn't exist in all machines, but if it does,
it's probably quite fast. 

Another way (hopefully somewhat more portable) to go is to be clever,
and to avoid string comparisons by modifying the algorithm so they
aren't needed (or are needed less frequently).  

After that you may be able to improve on the comparison method. One way
is to avoid calling strcmp if the first characters differ:

	#define STRCMP(s,t) (*(s) == *(t) ? strcmp((s), (t)) : *(s) - *(t))

(or something along those lines, this is untested). If your strings are
mostly going to differ in the first few bytes, you could pre-compute a
special value based on those bytes, and compare it first:

	#define BITS_IN_CHAR 8

	struct string {
		char *str;
		long value;
	};

	void compute_value(struct string *s) {
		int i;

		s->value = 0;
		for (i = 0; s->str[i] != '\0' && i < sizeof(long); ++i) {
			s->value <<= BITS_IN_CHAR;
			s->value |= s->str[i];
		}
		if (i < sizeof(long))
			s->value <<= BITS_IN_CHAR * (sizeof(long) - i);
	}

(The above could be optimized, but you get the idea.) 

If the pre-computed values differ, you know the strings differ in the
first sizeof(long) bytes, so there's no need to call strcmp. 
Furthermore, if you are careful to arrange the first character as the
most significant byte of the value (and so on), you can even find out
which string (according to the byte values, no locale collation sequence
here) is greater.  Of course, if you need the locale collation sequence,
you have to use strcmp.
-- 
Lars Wirzenius    wirzenius@cc.helsinki.fi

gwyn@smoke.brl.mil (Doug Gwyn) (03/17/91)

In article <1991Mar15.142821@mars.mpr.ca> stone@mars.mpr.ca (Darren Stone) writes:
>Does anyone know if there's a better (!?) way to compare
>strings than with strcmp()?
>I'm looking for any speed advantage I can get!

#define	StrEq( a, b )	(*(a) == *(b) && strcmp( a, b ) == 0)	/* UNSAFE */

dave@cs.arizona.edu (Dave Schaumann) (03/17/91)

In article <15486@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:
|In article <1991Mar15.142821@mars.mpr.ca> stone@mars.mpr.ca (Darren Stone) writes:
|>Does anyone know if there's a better (!?) way to compare
|>strings than with strcmp()?
|>I'm looking for any speed advantage I can get!
|
|#define  StrEq( a, b )	(*(a) == *(b) && strcmp( a, b ) == 0)	/* UNSAFE */

By "unsafe", I assume you mean it will likely crash on a NULL pointer.
I wouldn't be a bit suprised if strcmp() chokes on a NULL pointer, too.

But the main point I want to make is that this really buys you little unless
you are working under 2 conditions:

	- strcmp() is not inlined
	- function calls are expensive

I think you'll have to agree that any professional C compiler worth the name
would almost certainly inline *at least* the str???() functions; given an
environment where function calls are expensive.

Also, if you tend to compare equal successfully a lot, this will actually
be slower.

-- 
Dave Schaumann | dave@cs.arizona.edu | Short .sig's rule!

gvr@cs.brown.edu (George V. Reilly) (03/17/91)

In article <1991Mar15.142821@mars.mpr.ca> stone@mars.mpr.ca (Darren Stone) writes:
% 
% Does anyone know if there's a better (!?) way to compare
% strings than with strcmp()?
% 
% I'm looking for any speed advantage I can get!

If you're comparing strings for equality, you could use the
following macro:

#define STREQ(a,b) (*(a) == *(b) && (*(a) == '\0' || strcmp((a)+1,(b)+1) == 0))

which only calls strcmp() if the two strings agree in the first
character and if they're not zero-length strings.  Thus, for most
string comparisons, strcmp will not be called.  Unless the
implementors of your libraries really botched it, strcmp()
should already be extremely efficient.

You should be able to work out STRLT, STRLE, STRNE, STRGE, and STRGT
(if you need them) based on the above example.
________________
George V. Reilly   `Toad of Toad Hall'	gvr@cs.brown.edu   +1 (401) 863-7684
uunet!brunix!gvr   gvr@browncs.bitnet	Box 1910, Brown U, Prov, RI 02912

gwyn@smoke.brl.mil (Doug Gwyn) (03/17/91)

In article <1193@caslon.cs.arizona.edu> dave@cs.arizona.edu (Dave Schaumann) writes:
>|#define  StrEq( a, b )	(*(a) == *(b) && strcmp( a, b ) == 0)	/* UNSAFE */
>By "unsafe", I assume you mean it will likely crash on a NULL pointer.

No, I mean that it is an "unsafe function-like macro", meaning that
its arguments will be evaluated other than once each, unlike the
situation for a genuine function invocation.  This is something
that the user of the macro needs to be aware of.

I take it for granted that a null pointer should not be provided as
an argument to a function expecting an object pointer, unless it
is specifically advertised as doing something useful in that case.

>Also, if you tend to compare equal successfully a lot, this will actually
>be slower.

In most applications, comparing unequal would be far more frequent.
This optimization is a win in almost all situations, even when the
compiler generates in-line code for strcmp().

(Note, however, that one release of Gould's UTX-32 C compiler
generated bad code if an argument to StrEq() was a string literal;
the base registers got used up unnecessarily.  I think this was
fixed in a later release.)

If you happen to know that there is at least one character before
the terminator in each string, you can further optimize:

#define	StrEq1( a, b )	(*(a) == *(b) && strcmp( (a) + 1, (b) + 1 ) == 0)
					/* UNSAFE */

However, I prefer the first form, which works wherever strcmp() does,
so long as the multiple evaluation of its arguments is not an issue.

mark@urz.unibas.ch (03/18/91)

In article <1991Mar15.142821@mars.mpr.ca>, stone@mars.mpr.ca (Darren Stone) writes:
> 
> Does anyone know if there's a better (!?) way to compare
> strings than with strcmp()?
> 

	Have you already looked at what code an optimizing compiler
produces for STRCMP ? It may be a very good inline code (no subroutine call)
if the compiler is clever.
						G.Mark

drh@duke.cs.duke.edu (D. Richard Hipp) (03/18/91)

In article <15486@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:
>In article <1991Mar15.142821@mars.mpr.ca> stone@mars.mpr.ca (Darren Stone) writes:
>>Does anyone know if there's a better (!?) way to compare
>>strings than with strcmp()?
>>I'm looking for any speed advantage I can get!
>
>#define	StrEq( a, b )	(*(a) == *(b) && strcmp( a, b ) == 0)	/* UNSAFE */

StrEq does not duplicate the semantics of strcmp.  Recall that strcmp
returns either negative, zero, or positive depending on whether "a" is
less than, equal to, or greater than "b".  StrEq, on the other hand,
returns only true or false depending on whether or not "a" is
equal to "b".  This will make a big difference if you are using StrEq
to sort, or to do a binary search.

consp06@bingsuns.cc.binghamton.edu (Robert Konigsberg) (03/19/91)

Somebody said part of the problem with string comparisons is that if
the most of the strings are equivelant, there will be alot of
processor time.  Wouldn't it be good then, to include in the macro,
something to compare the actual POINTERS?  If the pointers are the
same then the two strings have no CHOICE but to be equivelant.  This
would really cut down the time under certain situations.

			-Rob Konigsberg

gwyn@smoke.brl.mil (Doug Gwyn) (03/19/91)

In article <669305557@juliet.cs.duke.edu> drh@duke.cs.duke.edu (D. Richard Hipp) writes:
>StrEq does not duplicate the semantics of strcmp.

Nor was it meant to.

peter@ficc.ferranti.com (Peter da Silva) (03/19/91)

In article <1193@caslon.cs.arizona.edu> dave@cs.arizona.edu (Dave Schaumann) writes:
> I think you'll have to agree that any professional C compiler worth the name
> would almost certainly inline *at least* the str???() functions; given an
> environment where function calls are expensive.

By "professional" I assume you mean "optimised for running dhrystone", right?
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

gwyn@smoke.brl.mil (Doug Gwyn) (03/19/91)

In article <1991Mar18.174207.7377@bingvaxu.cc.binghamton.edu> consp06@bingsuns.cc.binghamton.edu (Robert Konigsberg) writes:
>Wouldn't it be good then, to include in the macro, something to compare
>the actual POINTERS?  If the pointers are the same then the two strings
>have no CHOICE but to be equivelant.

Nice try, but in the vast majority of applications the strings being
compared, even in cases where they match, are contained in different
storage locations.  E.g.
	if ( StrEq( buffer, "EOF" ) ) ...

barton@cbnewsk.att.com (jim.m.barton) (03/19/91)

In article <15486@smoke.brl.mil>, gwyn@smoke.brl.mil (Doug Gwyn) writes:
> In article <1991Mar15.142821@mars.mpr.ca> stone@mars.mpr.ca (Darren Stone) writes:
> >Does anyone know if there's a better (!?) way to compare
> >strings than with strcmp()?
> >I'm looking for any speed advantage I can get!
> 
> #define	StrEq( a, b )	(*(a) == *(b) && strcmp( a, b ) == 0)	/* UNSAFE */

It is doubtful that there is one *general* way of optimizing strcmp() for
all machines. Most strcmp() implementations are written in machine language
and have already gone through some serious optimization analysis. Also,
since comparing NULL-terminated character strings is such a common operation,
many (non-RISC) machines today have some direct support for comparing strings
in their instruction sets. This means that doing a strcmp() can be done
in only few instructions on many machines, rather than a character-by-character
comparison loop.

Although there is probably no *general* way of optimizing strcmp(),
you might be able to do optimizations specific to your application by
taking advantage of the type of data and expected frequency of data used
in your application. The above macro, StrEq(), MIGHT be faster if most
of the strings being compared tend to differ in their first character. 
When two strings differ in their initial character, StrEQ(), will save
having to do a strcmp() call. Note, however, when a pair of strings
do NOT differ in their first character, a strcmp() must be done along with
the added overhead, (albeit small), of the extra character comparison
and the logical AND operation.

STrEq() might actually be SLOWER than a simple strcmp(),
depending on your application's data and your machine, (the overhead 
of doing the extra operations and the execution time for strcmp()).
IMHO, I doubt that StrEq() will acheive any significant optimization for
MOST cases - it may be either slower or be faster by an amount that will
be too small to measure.

A more fruitful approach might be to examine your data more closely to
see if some of the strcmp()'s could be avoided. For example, if many
of the string's being compared are actually part a restricted set of
values rather than arbitrary strings, you might be able to represent them
internally as enumerated values, (where integer comparisons could be done),
rather than as strings. Also, for strings that have fixed lengths or
known lengths, you could do a memcmp() rather strcmp() - although any
speedup here would certainly be machine-dependent.

Jim Barton

rsalz@bbn.com (Rich Salz) (03/19/91)

In <1991Mar18.174207.7377@bingvaxu.cc.binghamton.edu> consp06@bingsuns.cc.binghamton.edu (Robert Konigsberg) writes:
|  Wouldn't it be good then, to include in the macro,
|something to compare the actual POINTERS?  If the pointers are the
|same then the two strings have no CHOICE but to be equivelant.  This
|would really cut down the time under certain situations.
If you're careful about the circumstances under which you use it, you might
want to do something like this:
    #define EQ(p, q)                \
	    ((p) == (q) || ((p)[0] == (q)[0] && strcmp((p), (q)) == 0))
I did this in Coda ("File distribution alternative to rdist", comp.sources.unix
Volume 21) which builds lots of structures with shared strings.  I got a slight
performance gain (1 or 2%, I think).

It's the difference between (eq ... ) and (equal ...), right-p?
	/r$
-- 
Please send comp.sources.unix-related mail to rsalz@uunet.uu.net.
Use a domain-based address or give alternate paths, or you may lose out.

wirzenius@cc.helsinki.fi (Lars Wirzenius) (03/19/91)

In article <1991Mar18.174207.7377@bingvaxu.cc.binghamton.edu>, consp06@bingsuns.cc.binghamton.edu (Robert Konigsberg) writes:
> Somebody said part of the problem with string comparisons is that if
> the most of the strings are equivelant, there will be alot of
> processor time.  Wouldn't it be good then, to include in the macro,
> something to compare the actual POINTERS?  If the pointers are the
> same then the two strings have no CHOICE but to be equivelant.  This
> would really cut down the time under certain situations.

No, two strings may be equivalent, even if they are stored in different
places (this applies to string constants too, if they appear several
times). For example:

	char *s = "hello";
	char *t = "hello";

s and t may or may not point to the same character, you can't depend on
either behaviour. Another example:

	char s[10] = "hello"; 
	char t[10];

	strcpy(t, s);

Now the contents (up to and including the terminating '\0', but no
longer of course) of s and t are the same, but the addresses of the two
variables are different.

Of course, if two pointers to char are equivalent, the strings will be
the same, but this isn't the only occurence, nor do I think it is all
that common.

-- 
Lars Wirzenius    wirzenius@cc.helsinki.fi

steve@taumet.com (Stephen Clamage) (03/19/91)

peter@ficc.ferranti.com (Peter da Silva) writes:

>In article <1193@caslon.cs.arizona.edu> dave@cs.arizona.edu (Dave Schaumann) writes:
>> I think you'll have to agree that any professional C compiler worth the name
>> would almost certainly inline *at least* the str???() functions; given an
>> environment where function calls are expensive.

>By "professional" I assume you mean "optimised for running dhrystone", right?

I wouldn't assume so.  Some of the string functions, notably strcpy and
strlen, have shorter bodies than calling sequences.  A compiler writer
would choose to emit these functions inline because the code is both
smaller and faster.
-- 

Steve Clamage, TauMetric Corp, steve@taumet.com

torek@elf.ee.lbl.gov (Chris Torek) (03/20/91)

In article <15510@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:
>... in the vast majority of applications the strings being
>compared, even in cases where they match, are contained in different
>storage locations.  E.g.
>	if ( StrEq( buffer, "EOF" ) ) ...

Of course, you can build yourself a `string pool' system, in which each
distinct string appears exactly once, and then two strings match iff their
pointers match . . . but this merely offloads the `compare for truly
equal' into the string pool system.

If you need to do many compares, this is probably worthwhile anyway.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

consp06@bingsuns.cc.binghamton.edu (Robert Konigsberg) (03/20/91)

In article <15510@smoke.brl.mil>, gwyn@smoke.brl.mil (Doug Gwyn) writes:
|> In article <1991Mar18.174207.7377@bingvaxu.cc.binghamton.edu>
consp06@bingsuns.cc.binghamton.edu (Robert Konigsberg) writes:
|> >Wouldn't it be good then, to include in the macro, something to compare
|> >the actual POINTERS?  If the pointers are the same then the two strings
|> >have no CHOICE but to be equivelant.
|> 
|> Nice try, but in the vast majority of applications the strings being
|> compared, even in cases where they match, are contained in different
|> storage locations.  E.g.
|> 	if ( StrEq( buffer, "EOF" ) ) ...

Yeah, I figured. I just thought I'd give it a try.

			-Rob Konigsberg

scs@adam.mit.edu (Steve Summit) (03/20/91)

In article <1991Mar19.034802.24340@cbnewsk.att.com> barton@cbnewsk.att.com (jim.m.barton) writes:
>A more fruitful approach might be to examine your data more closely to
>see if some of the strcmp()'s could be avoided. For example, if many
>of the string's being compared are actually part a restricted set of
>values rather than arbitrary strings, you might be able to represent them
>internally as enumerated values, (where integer comparisons could be done)...

Indeed.  When it's important to do so, reducing the number of
explicit string comparisons is not hard, even when the strings
are neither restricted nor known in advance.  Hashing is a very
powerful and surprisingly easy technique to use.

I was just writing a diff-like program the other day.  By
representing strings not with individual char *'s, but rather
with instances of this structure:

	struct chunk
		{
		char *c_ptr;
		unsigned int c_hash;
		};

, and by keeping the c_hash field filled in with a hash value
derived from the string pointed to by c_ptr, I could compare
strings for equality with this macro:

	#define Chunkeq(cp1, cp2) ((cp1)->c_hash == (cp2)->c_hash && \
				strcmp((cp1)->c_ptr, (cp2)->c_ptr) == 0)

When you have a lot of strings to compare, the time spent
computing hash values, and the extra space to store them,
can be well worth it.

You can also write great little symbol table implementations
using hashing, and the hash values don't even have to take up
extra space (they're implicit in the bucket or slot indices).

(These techniques are closely related to the "string pool"
modules mentioned by Doug and Chris.  In fact, a string pool
would almost certainly be written using hashing for string
insertion.)

See any algorithms book for much more on hashing; it's got
nothing particularly to do with C.  (Please don't ask "what's
the best string hash function?", which is an FAQ, though not on
the list yet.)

                                            Steve Summit
                                            scs@adam.mit.edu

marc@arnor.UUCP (Marc Auslander) (03/20/91)

A correct compiler cannot inline strcmp unless it is given
extralingual permission to do so - for example in a pragma or by
defining some reserved symbol or by some other means.  This is because
it is legal to combine a program which uses strcmp with another which
defines it to by anything at all!

For example, the Risc System/6000 xlc compiler inlines __strcmp and
__strcpy.  (In Ansi C, its ok for the compiler to do funny things with
symbols which start with "__").  Then, string.h contains defines of
strcmp and strcpy to the __ versions which are done if __STR__ is
defined.  From the user point of view, you define __STR__ if you want
in line string functions.  On user approach is to add -D__STR__ to the
CFLAGS in his make file.

(If you're going to the trouble of inlining, its a good idea to do it
in a way which lets you get the advantage in benchmarks!)

--


Marc Auslander       <marc@ibm.com>

flint@gistdev.gist.com (Flint Pellett) (03/21/91)

The original poster apparently is only interested in an equal/not equal
result from strcmp(), not the before/equal/after result that strcmp()
actually returns.  I'd like to expand the question a bit: I'm curious if
anyone has a method that actually speeds up strcmp() in the situtation
where you have to get the lexicographic ordering information back.  (Such
as when the strcmp() is used as the basis for comparison in a bsearch().)
-- 
Flint Pellett, Global Information Systems Technology, Inc.
1800 Woodfield Drive, Savoy, IL  61874     (217) 352-1165
uunet!gistdev!flint or flint@gistdev.gist.com

henry@zoo.toronto.edu (Henry Spencer) (03/21/91)

In article <1991Mar19.034802.24340@cbnewsk.att.com> barton@cbnewsk.att.com (jim.m.barton) writes:
>... The above macro, StrEq(), MIGHT be faster if most
>of the strings being compared tend to differ in their first character...
>IMHO, I doubt that StrEq() will acheive any significant optimization for
>MOST cases...

Sorry, there is considerable experimental evidence that it does.  The fact
is, most string comparisons fail on the very first character, and even a
compiler-inlined strcmp() -- not all that common in the Unix world yet,
thanks to our geriatric compilers -- is generally a more expensive way
of discovering this.
-- 
"[Some people] positively *wish* to     | Henry Spencer @ U of Toronto Zoology
believe ill of the modern world."-R.Peto|  henry@zoo.toronto.edu  utzoo!henry

henry@zoo.toronto.edu (Henry Spencer) (03/21/91)

In article <1991Mar16.211154.5568@cc.helsinki.fi> wirzenius@cc.helsinki.fi (Lars Wirzenius) writes:
>For non-portability, the fastest way would probably be to write inline
>assembly code using the machine's string compare instruction...

Actually, not necessarily so.  Even on machines which have a string-compare
instruction, that instruction is often relatively slow.  It can be quicker
to do it yourself.

There are also some fiendish tricks that can be played to considerably beat
the performance of the obvious compare loop, and of any string-compare
instruction which basically just implements that loop.  However, you'll
have to wait a bit longer to hear about them, since the summer Usenix
conference rejected my paper...
-- 
"[Some people] positively *wish* to     | Henry Spencer @ U of Toronto Zoology
believe ill of the modern world."-R.Peto|  henry@zoo.toronto.edu  utzoo!henry

gwyn@smoke.brl.mil (Doug Gwyn) (03/21/91)

In article <11111@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
-Of course, you can build yourself a `string pool' system, in which each
-distinct string appears exactly once, and then two strings match iff their
-pointers match . . . but this merely offloads the `compare for truly
-equal' into the string pool system.
-If you need to do many compares, this is probably worthwhile anyway.

Gee, great minds think alike!

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/21/91)

In article <1991Mar20.174452.4246@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>In article <1991Mar19.034802.24340@cbnewsk.att.com> barton@cbnewsk.att.com (jim.m.barton) writes:
>>... The above macro, StrEq(), MIGHT be faster if most
>>of the strings being compared tend to differ in their first character...
>>IMHO, I doubt that StrEq() will achieve any significant optimization for
>>MOST cases...
>Sorry, there is considerable experimental evidence that it does.  The fact
>is, most string comparisons fail on the very first character, and even a

An easy method for verifying that two strings typically differ
in their first character run the following (dumb) sorting program.
For the first 10k words out of /usr/dict/words scrambled up in
essentially random order, it says 89% of the comparisons failed on their
first character. 

This is not much of a test, granted, and there _will_
be some cases (e.g. sorting nearly-sorted lists of words) where
comparisons _wont_ be decided by the first characters. However, even in
these latter cases the overhead of an extra `compare' versus the
overhead of the strcmp loop will be small -- please feel free to
experimentally verify this. 

(As a back of the envelope calculation
we can say that if it costs S instructions for the strcmp code and
C instruction(s) for the compare then on average we will perform
C + .11 S instructions. We can then say ``how much would C have to be
to make it worth using only S?'' We find break even at C = .89 S. Therefore, 
provided the compare sequence (including jumps and register loads) on your 
machine is less than 89% of a (typical) call/inline of strcmp then USE THE 
COMPARE).

As a slight side note there are certain contexts where string comparisons
can be made MUCH more efficient than a character-by-character compare.

The well-known `string search' algorithms that essentially compare
_rare_ characters within each string (e.g. if one string contains a `z'
then first search for `z' in the other string -- if found then match the
rest) & then adjust pointers by an appropriate amount (rather than just 
incrementing them) depending on the outcome of each character comparison is 
one example. 

I remember reading an article in Software Practice and Experience 
that compared such a statistical method with 
string-matching INSTRUCTIONS (I _think_ it was on an IBM 370 so that dates 
me doesnt it?) and the algorithm was found to be SUBSTANTIALLY faster
(like an order-of-magnitude, unless I'm mistaken) than the hardware
string compare.

-kym

/* Dumb sort program that figgas out how many strings (mis)match on 
first character -- it says 11% match for /usr/dict/words. */
#include <stdio.h>
#include <string.h>

#define MAXWORD 10000

char *word[MAXWORD];
int nw=0;
int nmatch1=0;
int ncmp=0;

main(){
	init();
	scramble();
	sort();
	printf("fraction match first char=%g\n",(double)nmatch1/ncmp);
	}

init() {
	int i;
	char buf[100];
	for(i=0;i<MAXWORD;i++){
		gets(buf);
		word[i]=strcpy(malloc(strlen(buf)+1),buf);
		}
	}

scramble() {
	int i;
	for(i=0;i<MAXWORD;i++){
		int j=rand()%MAXWORD;
		char *tmp=word[i];
		word[i]=word[j];
		word[j]=tmp;
		}
	}

sort() {
	int i,j,k;
	char *tmp;
	for(i=0;i<MAXWORD;i++){
		j=i;
		tmp=word[j];
		for(k=j+1;k<MAXWORD;k++)
		if(cmp(word[k],tmp)<0) j=k,tmp=word[j];
		word[j]=word[i];
		word[i]=tmp;
		}
	}

cmp(s1,s2) char *s1,*s2; {
	++ncmp;
	if(*s1==*s2) ++nmatch1;
	return strcmp(s1,s2);
	}

psmith@iies.ecn.purdue.edu (Paul F Smith) (03/21/91)

>In article barton@cbnewsk.att.com (jim.m.barton) writes:
>
>	struct chunk
>		{
>		char *c_ptr;
>		unsigned int c_hash;
>		};
>
...
>
>You can also write great little symbol table implementations
>using hashing, and the hash values don't even have to take up
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>extra space (they're implicit in the bucket or slot indices).
 ^^^^^^^^^^^!!!
>
>                                            Steve Summit
>                                            scs@adam.mit.edu

OK, I won't ask about a string hash function :-), but could you 
please elaborate on how your symbol table works. I just implemented a
hash table that keeps the hash value around (like your struct above), 
which I then use in scanning the list at the given bucket to minimize
strcmp()s.  But, I can't see how you can get rid of it and still reduce 
the number of strcmp()s. Please describe, Thanks!

--
------------------------------------------------------------------
Paul F. Smith - ADPC - Purdue University     psmith@ecn.purdue.edu
<someday I'll think of something profound to put here, maybe>

scs@adam.mit.edu (Steve Summit) (03/21/91)

In article <1991Mar20.020957.9180@athena.mit.edu>, I wrote:
>You can also write great little symbol table implementations
>using hashing, and the hash values don't even have to take up
>extra space (they're implicit in the bucket or slot indices).

In article <1991Mar20.213615.11223@noose.ecn.purdue.edu> psmith@iies.ecn.purdue.edu (Paul F Smith) writes:
>...could you 
>please elaborate on how your symbol table works. I just implemented a
>hash table that keeps the hash value around (like your struct above), 
>which I then use in scanning the list at the given bucket to minimize
>strcmp()s.  But, I can't see how you can get rid of it and still reduce 
>the number of strcmp()s. Please describe, Thanks!

Well, I didn't say the number of strcmp's was reduced to 1.
Hashing has done its work (i.e. eliminated a lot of strcmp's and
associated searching) when it has computed the right bucket to
look in; there aren't supposed to be large numbers of collisions
to scan over.  I just call strcmp for each entry in the bucket.
The code is tiny, so I'll show it:

struct symtabent
	{
	char *st_name;
	struct type *st_type;
	somethingorother st_value;
	struct symtabent *st_next;
	};

struct symtab
	{
	int st_hashsize;
	struct symtabent *st_hashtab[1];	/* dynamically extended */
			/* ("unwarranted chumminess with the compiler") */
	};

stinsert(step, stp)
struct symtabent *step;
struct symtab *stp;
{
unsigned int h = hash(step->st_name) % stp->st_hashsize;
/* simpleminded; doesn't check for duplicates */
step->st_next = stp->st_hashtab[h];
stp->st_hashtab[h] = step;
}

struct symtabent *
stlookup(name, stp)
char *name;
struct symtab *stp;
{
unsigned int h = hash(name) % stp->st_hashsize;
register struct symtabent *step;

for(step = stp->st_hashtab[h]; step != NULL; step = step->st_next)
	{
	if(strcmp(step->st_name, name) == 0)
		return step;
	}

return NULL;
}

                                            Steve Summit
                                            scs@adam.mit.edu

gwyn@smoke.brl.mil (Doug Gwyn) (03/21/91)

In article <MARC.91Mar20080239@marc.watson.ibm.com> marc@arnor.UUCP (Marc Auslander) writes:
>A correct compiler cannot inline strcmp unless it is given
>extralingual permission to do so - for example in a pragma or by
>defining some reserved symbol or by some other means.  This is because
>it is legal to combine a program which uses strcmp with another which
>defines it to by anything at all!

Conforming implementations ARE allowed to in-line the standard library
functions.  They must also provide external functions for most of them,
in case some application decides it needs a pointer to one of those
functions, etc.

It is true that a strictly conforming program is allowed to #define
strcmp, under certain circumstances, but that is not an issue.  It is
the use of "strcmp" as an identifier with external linkage that an
in-lining compiler must key on.

It is also possible for <string.h> to #define strcmp __strcmp and have
the compiler in-line __strcmp instead of strcmp.  This may make it
slightly easier for the implementation to support access to the
external-function definition of strcmp mentioned in my first paragraph.

sef@kithrup.COM (Sean Eric Fagan) (03/21/91)

In article <15537@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:
>It is also possible for <string.h> to #define strcmp __strcmp and have
>the compiler in-line __strcmp instead of strcmp.  This may make it
>slightly easier for the implementation to support access to the
>external-function definition of strcmp mentioned in my first paragraph.

ANSI C still has the requirement that

	#undef	strcmp

(after including <string.h>) give the user the version from the library,
doesn't it?  (A footnone in one of the sections; I don't have my copy with
me, so I can't give explicit details.)  In any event, that footnote seemed
to indicate pretty well how the committee thought intrinsics should be
handled...

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

rh@smds.UUCP (Richard Harter) (03/21/91)

In article <11111@dog.ee.lbl.gov>, torek@elf.ee.lbl.gov (Chris Torek) writes:
> In article <15510@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:
> >... in the vast majority of applications the strings being
> >compared, even in cases where they match, are contained in different
> >storage locations.  E.g.
> >	if ( StrEq( buffer, "EOF" ) ) ...

> Of course, you can build yourself a `string pool' system, in which each
> distinct string appears exactly once, and then two strings match iff their
> pointers match . . . but this merely offloads the `compare for truly
> equal' into the string pool system.

Just so.  In one of our programs where there was a great deal of manipulation
of strings we did a string "object" package (C not C++).  For each unique
string under management there is a struct which looks something like

	struct string_object {
		char	*c;		/* The string itself	*/
		int	 n;		/* Its length		*/
		int	 rc;		/* Reference count	*/
		int	 sz;		/* Allocated space for string */
		struct string_object *link;	/* Data access	*/
		};

The pool used a hash table with short buckets (the link field).  In the
case in question there are many more references to existing strings (and
multiple instances) then there are references to new strings so using
pointers into the pool is a big win.  The only object primitives are
add_string and delete_string.  Delete_string decrements a reference
count and does the appropriate when the count goes to zero.  Add_string
has to do the compare, which is more expensive than a simple compare
because it has to compute the hash index (which only needs to be very
quick and dirty).  If the string exists the reference count is upped;
otherwise a new object is created (actually taken off a free list.)
The size field is there because the strings are copied into the struct.
This is more overhead, but it is worth it in this case, because the
strings in question are often substrings.  The upshot is that there is
very little allocation/deallocation which is much more expensive than
string compares.

It would be interesting to hear how other people have handled this
sort of problem.
-- 
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398 
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb.  This sentence short.  This signature done.

psmith@iies.ecn.purdue.edu (Paul F Smith) (03/22/91)

In article <1991Mar20.235613.20480@athena.mit.edu> scs@adam.mit.edu writes:
>In article <1991Mar20.213615.11223@noose.ecn.purdue.edu> psmith@iies.ecn.purdue.edu (Paul F Smith) writes:
>>...could you 
>>please elaborate on how your symbol table works. I just implemented a
>>hash table that keeps the hash value around (like your struct above), 
>>which I then use in scanning the list at the given bucket to minimize
>>strcmp()s.  But, I can't see how you can get rid of it and still reduce 
>>the number of strcmp()s. Please describe, Thanks!
>
>Well, I didn't say the number of strcmp's was reduced to 1.
>Hashing has done its work (i.e. eliminated a lot of strcmp's and
>associated searching) when it has computed the right bucket to
>look in; there aren't supposed to be large numbers of collisions
>to scan over.  

Ok. You were just talking about the normal effect of a hash table.
I used a hash table, but for each entry I keep the full hash value
to try and keep the # of strcmp()s close to 1. Like this...

>
>for(step = stp->st_hashtab[h]; step != NULL; step = step->st_next)
>	{
>	if(strcmp(step->st_name, name) == 0)

Instead I do:
(compute "hash(name)" once at the beginning)

if((step->hashval == hash(name)) && strcmp(step->st_name, name) == 0)

>		return step;
>	}
>

I thought you had achieved this property without keep the hashvalue with
the symbol entry. (wouldn't that be nice!)

I think this reduction in strcmp()s is worth the space. Does anyone
know of a reason that it isn't? (other than not having enough space :-)

Thanks for the reply!

--
------------------------------------------------------------------
Paul F. Smith - ADPC - Purdue University     psmith@ecn.purdue.edu
<someday I'll think of something profound to put here, maybe>

willcr@bud.sos.ivy.isc.com (Will Crowder) (03/22/91)

In article <1991Mar20.213615.11223@noose.ecn.purdue.edu>, psmith@iies.ecn.purdue.edu (Paul F Smith) writes:
|> >In article barton@cbnewsk.att.com (jim.m.barton) writes:
|> >
|> >	struct chunk
|> >		{
|> >		char *c_ptr;
|> >		unsigned int c_hash;
|> >		};
|> >
|> ...
|> >
|> >You can also write great little symbol table implementations
|> >using hashing, and the hash values don't even have to take up
|>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|> >extra space (they're implicit in the bucket or slot indices).
|>  ^^^^^^^^^^^!!!
|> >
|> >                                            Steve Summit
|> >                                            scs@adam.mit.edu
|> 
|> OK, I won't ask about a string hash function :-), but could you 
|> please elaborate on how your symbol table works. I just implemented a
|> hash table that keeps the hash value around (like your struct above), 
|> which I then use in scanning the list at the given bucket to minimize
|> strcmp()s.  But, I can't see how you can get rid of it and still reduce 
|> the number of strcmp()s. Please describe, Thanks!

(Since we're talking about one of my favorite data structures, hashing
with open chaining, I'll take a shot at this one.)

Hashing with open chaining is a method of saving information such that it
can be looked up quickly.  The basic idea is to deal with hash collisions
by maintaining a list (usually a linked-list) of data with identical hash
values.  Then each hash "bucket" points to the list of data having the
same hash value.  To look up a key, they key is hashed and the chain having
that hash value is scanned for the data.  Since all data in a given chain
have the same hash value, there is no need to save the hash value anywhere.

One optimization might be to further hash the data in each chain, obviously 
with a different hashing algorithm, and then rehash the key and scan
the chain for data having the same new hash value.  In this case,
you would have to save the new hash value along with the data.  You could
also have chains on top of chains, etc., etc., but a simple single-level
algorithm with one set of buckets is usually efficient enough.

|> ------------------------------------------------------------------
|> Paul F. Smith - ADPC - Purdue University     psmith@ecn.purdue.edu
|> <someday I'll think of something profound to put here, maybe>

Hope this helps,

Will

--------------------------------------------------------------------------------
Will Crowder, MTS            | "I belong to no organized politcal party.  
(willcr@ivy.isc.com)         |  I'm a democrat."  
INTERACTIVE Systems Corp.    |		-- Will Rogers

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/22/91)

In article <1151@gistdev.gist.com> flint@gistdev.gist.com (Flint Pellett) writes:
> I'm curious if
> anyone has a method that actually speeds up strcmp() in the situtation
> where you have to get the lexicographic ordering information back.

Unrolling the loop produces good results on many machines. You can also
zoom through the strings comparing words, then compare characters if
necessary (i.e., when your byte order is backwards) to produce a result.
It works best if you know the string lengths beforehand and have each
string starting on a word boundary. Piercarlo can explain a lot more.

---Dan

henry@zoo.toronto.edu (Henry Spencer) (03/22/91)

In article <MARC.91Mar20080239@marc.watson.ibm.com> marc@arnor.UUCP (Marc Auslander) writes:
>A correct compiler cannot inline strcmp unless it is given
>extralingual permission to do so - for example in a pragma or by
>defining some reserved symbol or by some other means.  This is because
>it is legal to combine a program which uses strcmp with another which
>defines it to by anything at all!

In ANSI C it is *not* legal for a program to redefine the standard
library functions (not as external identifiers anyway), so the inlining
is, subject to some care about fine points, legitimate.
-- 
"[Some people] positively *wish* to     | Henry Spencer @ U of Toronto Zoology
believe ill of the modern world."-R.Peto|  henry@zoo.toronto.edu  utzoo!henry

torek@elf.ee.lbl.gov (Chris Torek) (03/22/91)

[This is a correction to a previous article, which should have been
stamped out before you read it, because of the Supersedes header, but I
suspect there are news sites out there that do not implement
Supersedes.  The correction is just an `='-for-`-' typo caused by
runaway fingers.  Thanks to David Barnett for catching it early.]

In article <1151@gistdev.gist.com> flint@gistdev.gist.com
(Flint Pellett) writes:
>... I'm curious if anyone has a method that actually speeds up strcmp()
>in the situtation where you have to get the lexicographic ordering
>information back.

For a particular application (my own personal version of Emacs, actually),
I found that comparing the first two characters `inline' gave the best
results.

/*
 * Find an entry in a table.  Create should be true iff we should create
 * a new entry if the name is not in the table.  The names of any new
 * entries are saved with savestr().
 */
struct tentry *
FindEntry(t, name, create)
	struct table *t;
	register char *name;
	int create;
{
	register struct tentry **ents, *te;
	register int i, h, m, l;

	if (name == NULL)
		return (NULL);
	if (t->t_sorted <= 0)
		SortEntries(t);
	ents = t->t_ents;
	/* binary search */
	h = t->t_size - 1;
	l = 0;
	while (h >= l) {
		te = ents[m = (h + l) >> 1];
		/* first two characters done inline for speed */
		if ((i = te->te_name[0] - name[0]) == 0 && name[0] != 0) {
			if ((i = te->te_name[1] - name[1]) == 0)
				i = strcmp(te->te_name + 1, name + 1);
		}
		if (i > 0)
			h = m - 1;
		else if (i < 0)
			l = m + 1;
		else		/* found */
			return (te);
	}
	if (create) {
		if ((name = savestr(name)) != NULL &&
		    (te = insert(t, name, &ents[l])) != NULL)
			return (te);
		if (name != NULL)
			free(name);
		error("out of memory allocating entry in %s", t->t_name);
	}
	return (NULL);
}

SortEntries uses a clever trick: it does an insertion sort if the
table is `almost' sorted, otherwise it does a quicksort.

Much of this speed-hacking is due to *measured* results showing that my
Emacs spent most of its `startup' time building tables and reading the
current directory.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

xwang@gmuvax2.gmu.edu (Xiang-Min Wang) (03/24/91)

In article <1991Mar18.174207.7377@bingvaxu.cc.binghamton.edu> consp06@bingsuns.cc.binghamton.edu (Robert Konigsberg) writes:
>Somebody said part of the problem with string comparisons is that if
>the most of the strings are equivelant, there will be alot of
>processor time.  Wouldn't it be good then, to include in the macro,
>something to compare the actual POINTERS?  If the pointers are the
>same then the two strings have no CHOICE but to be equivelant.  This
>would really cut down the time under certain situations.
>
>			-Rob Konigsberg


This solution can only be used for a limited situation. This solution
could lead to the aliasing problem.

xwang

browns@iccgcc.decnet.ab.com (Stan Brown) (03/26/91)

In article <MARC.91Mar20080239@marc.watson.ibm.com>, marc@arnor.UUCP (Marc Auslander) writes:
> A correct compiler cannot inline strcmp unless it is given
> extralingual permission to do so - for example in a pragma or by
> defining some reserved symbol or by some other means.  This is because
> it is legal to combine a program which uses strcmp with another which
> defines it to by anything at all!

Assuming that by 'correct compiler' you mean one that conforms to the 
ANSI standard, I have to disagree with your statement.

Sec 4.1.6 of the standard specifically allows compilers to inline
the standard library functions.

My opinions are mine:  I don't speak for any other person or company.
                   email (until 91/4/30): browns@iccgcc.decnet.ab.com
Stan Brown, Oak Road Systems, Cleveland, Ohio, USA    +1 216 371 0043