[alt.sources.d] Fast strcmp

poffen@sj.ate.slb.com (Russell Poffenberger) (09/28/90)

In article <OTTO.90Sep27145643@tukki.jyu.fi> otto@tukki.jyu.fi (Otto J. Makela) writes:
>In article <1646@cherry.edc.UUCP> fraser@edc.UUCP (Fraser Orr) writes:
>   In article <12145@crdgw1.crd.ge.com> larocque@jupiter.crd.ge.com (David M. LaRocque) writes:
>   >After I profiled my C program I discovered that the function
>   >strcmp() takes one third of my program's CPU time.  I was hoping
>   >someone may have written their own version of strcmp() that
>   >outperforms the library's function.
>
>   One quick dirty thing I did once was to change
>	   if (strcmp (a,b)==0)
>   to
>	   if (*a==*b && (strcmp(a.b)==0))
>
>   I seem to remember a remarkable performance improvement, like about 5
>   times faster.  Probably due to the fact that the program mainly did
>   strcmp and the strcmp was pretty bad.
>
>This should obviously be:
>	if(*a==*b && (strcmp(a,b)==0))
>(just in case...)
>
>I'd believe if your compiler is really brain-damaged, this could also help
>a teeny weeny bit:
>	if(*a==*b && (!strcmp(a,b)))

The reason for a performance increase is that statistically, for random
strings, more often than not, the first character will not match, so that
further checking is not needed. In the example given, this avoids making a
function call and the overhead associated with it.

Creating inline functions for strcmp type operations (if your compiler supports
them) will help a lot by avoiding a function call entirely.

Russ Poffenberger               DOMAIN: poffen@sj.ate.slb.com
Schlumberger Technologies       UUCP:   {uunet,decwrl,amdahl}!sjsca4!poffen
1601 Technology Drive		CIS:	72401,276
San Jose, Ca. 95110             (408)437-5254

karl@MorningStar.Com (Karl Fox) (10/05/90)

In article <CEDMAN.90Oct4171710@lynx.ps.uci.edu> cedman@lynx.ps.uci.edu (Carl Edman) writes:

   In article <1145@exodus.Eng.Sun.COM> falk@peregrine.Sun.COM (Ed Falk) writes:
      In article <6003@hplabsz.HPL.HP.COM> sartin@hplabs.hp.com (Rob Sartin) writes:
      Also, these two strings

	      "ab\0x"
	      "ab\0y"

      (where x and y are any garbage that happens to be in memory after the
      terminating '\0') will be evaluated as unequal.

It ain't necessarily so.  Any correct implementation will take into
account only the data actually in the string.  Saying "32-bit CRC"
does not imply reading the string 4 bytes at a time.

      There are workarounds for both problems, of course, but I think there
      won't be much of a performance improvement after you've done all it
      requires to get it right.

This is wrong too; when I added "string comparison hashing" to a diff
program, it more than doubled the speed.  For something like diff,
8-bit or 16-bit values are good enough, assuming an adequate hashing
function.

   You can even imagine systems where it could be possible to remove
   the actual string from memory, and simply assume that if the 32-bit
   CRC match, the strings match. Such systems would have to be tolerant
   about an occasional mismatch.

   If memory serves correctly the above approach is used in some
   implementations for diff. (only to give one practical, real world
   example)

Diff still needs to compare the actual strings if the hash values
match.  Having a diff that is some amount faster but that "is
occasionally wrong" wouldn't be too popular, I'd think.
--
"I hear you guys deal with such dreck  |  Karl Fox, Morning Star Technologies
as SNA and X.25."       -Ed Vielmetti  |  karl@MorningStar.Com

cedman@lynx.ps.uci.edu (Carl Edman) (10/07/90)

In article <1990Oct5.122245.392@crappie.MorningStar.Com> karl@MorningStar.Com (Karl Fox) writes:
   I write:
      You can even imagine systems where it could be possible to remove
      the actual string from memory, and simply assume that if the 32-bit
      CRC match, the strings match. Such systems would have to be tolerant
      about an occasional mismatch.

      If memory serves correctly the above approach is used in some
      implementations for diff. (only to give one practical, real world
      example)

   Diff still needs to compare the actual strings if the hash values
   match.  Having a diff that is some amount faster but that "is
   occasionally wrong" wouldn't be too popular, I'd think.

Now, if I remember correctly, that wasn't really a problem. I haven't
got the source in front of me , but I remember it goes something like
this.

    1. Read both files , calculate the CRC for each line, and then forget
       the files themselves

    2. Do the standard algorithm based on the CRCs alone

    3. Check the correctness of the CRC comparison.
       If it is correct, fine
       If it isn't then
           4. print a warning message
           5. add another diff command to the output to make up for the
              inequality of these 2 lines.

This algorithm was quite fast, and allowed the comparison of truely
giantic files. The only penalty you pay for an incorrect comparison
was a (possibly) suboptimal diff, but that was a small and relatively
infrequent problem. So you could rely on the correctness of your
diff output.

	Carl Edman




Theorectial Physicist,N.:A physicist whose   | Send mail
existence is postulated, to make the numbers |  to
balance but who is never actually observed   | cedman@golem.ps.uci.edu
in the laboratory.                           | edmanc@uciph0.ps.uci.edu