[comp.lang.c] strcmp

jk@cs.man.ac.uk (John Kewley ICL) (06/14/91)

Are there any  examples of source code for strcmp, I need to implement code for
a large variety of types in the style of strcmp and
would like to see what has been done before.

I know this seems like a simple school exercise, but I would like to see various
implementations.

e.g.

	intcmp : 
		((a>b)+(a>=b))-1
	or
		(a>=b>?(a>b):-1

	How much quicker is the second one likely to be? It also seems "nicer"
as well, adding two boolean values is not very "nice".

--
        J.K.
 
John M. Kewley, ICL, Wenlock Way, West Gorton, Manchester. M12 5DR
Tel:   (+44) 61 223 1301 X2138  Email: jk@cs.man.ac.uk / jk@nw.stl.stc.co.uk

yanek@panix.uucp (Yanek Martinson) (06/18/91)

For intcmp(a,b) why not just use a-b ? That will tell you if they are equal
and if not, which is greater.

For strcmp(s,t): while(*s++==*t++&&*s&&*t); return *s-*t;

henry@zoo.toronto.edu (Henry Spencer) (06/18/91)

In article <1991Jun18.074029.12226@panix.uucp> yanek@panix.uucp (Yanek Martinson) writes:
>For intcmp(a,b) why not just use a-b ? That will tell you if they are equal
>and if not, which is greater.

If the subtraction doesn't overflow, that is.  (The effects of overflow
are undefined, and strange things can happen even with old compilers.)

>For strcmp(s,t): while(*s++==*t++&&*s&&*t); return *s-*t;

Apart from being inefficient, this code is wrong, since the final comparison
must be done as if characters are unsigned.  (This is not precisely the rule
I would have picked, but at least ANSI C did pick something specific...)
That's aside from the fact that the final comparison is being done on the
two characters *after* the mismatch rather than on the mismatch itself...
-- 
"We're thinking about upgrading from    | Henry Spencer @ U of Toronto Zoology
SunOS 4.1.1 to SunOS 3.5."              |  henry@zoo.toronto.edu  utzoo!henry

torek@elf.ee.lbl.gov (Chris Torek) (06/19/91)

>In article <1991Jun18.074029.12226@panix.uucp>
>yanek@panix.uucp (Yanek Martinson) writes:
>>For strcmp(s,t): while(*s++==*t++&&*s&&*t); return *s-*t;

In article <1991Jun18.153653.1494@zoo.toronto.edu> henry@zoo.toronto.edu
(Henry Spencer) writes:
>Apart from being inefficient, this code is wrong, since the final comparison
>must be done as if characters are unsigned. ... That's aside from the fact
>that the final comparison is being done on the two characters *after* the
>mismatch rather than on the mismatch itself...

... which may not even exist (if the loop stopped on '\0').

The usual way to write strcmp, if one is not interested in working hard
---though one should be; a fast strcmp improves one's Dhrystone benchmarks,
which as we all know is more important than speeding up real programs :-)
---is this:

	int strcmp(char *s1, char *s2) {
		/*
		 * Loop until different, but stop if both are '\0'.
		 */
		while (*s1++ == *s2)
			if (*s2++ == '\0')
				return (0);
		/*
		 * *--s1 and *s2 differ.  One might be '\0'.
		 */
		return (*(unsigned char *)--s1 - *(unsigned char *)s2);
	}

although this assumes that the subtraction will not overflow.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

thomasw@hpcupt1.cup.hp.com (Thomas Wang) (06/19/91)

> Are there any  examples of source code for strcmp, I need to implement
> code for a large variety of types in the style of strcmp and
> would like to see what has been done before.

> I know this seems like a simple school exercise, but I would like to see
> various implementations.

In my opinion, the +1, 0, and -1 comparisons are flawed, because they do not
take account unordered comparisons.  So I usually define 6 macros.

#define COMP_GREATER   (1)
#define COMP_EQUAL     (0)
#define COMP_LESS      (-1)
#define COMP_UNORDERED (MININT)

#define GE(x) ((x) >= COMP_EQUAL)  /* greater than or equal */
#define LE(x) (-(x) >= COMP_EQUAL) /* less than or equal, -MININT == MININT */

This system is compatible with strcmp(), yet able to deal with unordered
comparison.  For example, compare 1.0 against NaNs should return
COMP_UNORDERED.

int32 intcmp(int a, int b)
{
   return((a == b) ? COMP_EQUAL : ((a > b) ? COMP_GREATER : COMP_LESS));
}

 -Thomas Wang
              (Everything is an object.)
                                                     wang@hpdmsjlm.cup.hp.com
                                                     thomasw@hpcupt1.cup.hp.com

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (06/19/91)

In article <1991Jun18.074029.12226@panix.uucp>, yanek@panix.uucp (Yanek Martinson) writes:
> For strcmp(s,t): while(*s++==*t++&&*s&&*t); return *s-*t;

Sorry, there's a mistake.  Let p = "\0followed by 50 million bytes of junk";
Then strcmp(p,p) will set s=p, t=p, find *s == *t == '\0', increment s and t,
and because the _next_ bytes aren't \0 it will keep going.  Not good.
In fact, if you call strcmp("\0a", "\0b") with that definition, it will
claim they _aren't_ equal, but they are both empty!

Keep it simple:
	while (*s == *t && *s != '\0' && *t != '\0')
	    s++, t++;
	return *s - *t;
Much as I hate to say this, a good compiler is likely to do just as well
from this as from anything cleverer.

-- 
Q:  What should I know about quicksort?   A:  That it is *slow*.
Q:  When should I use it?  A:  When you have only 256 words of main storage.

torek@elf.ee.lbl.gov (Chris Torek) (06/20/91)

In article <14421@dog.ee.lbl.gov> I wrote:
>	int strcmp(char *s1, char *s2) {

This should be

	int strcmp(const char *s1, const char *s2) {

Thanks to Matthew Farwell <dylan@ibmpcug.co.uk> for pointing this out.

>		return (*(unsigned char *)--s1 - *(unsigned char *)s2);
>... this assumes that the subtraction will not overflow.

The word `overflow' is wrong.  The subtraction cannot overflow.  There
are several cases:

	1. K&R C: `unsigned char's widen to `unsigned int's.
	   Then the subtraction is done in unsigned arithmetic,
	   which is modular arithmetic in 2^(number of bits).

	2. ANSI C:
	   2a:	sizeof(char) < sizeof(int).  Then `unsigned char's
		widen to signed `int's that are nonetheless
		nonnegative.  If the underlying system is 2's
		complement, 1's complement, or sign-magnitude, the
		subtraction will not overflow.  (If it is something
		else, I am not sure what happens.  With any luck,
		it does not overflow.)

	   2b:	sizeof(char) == sizeof(int).  Then `unsigned char's
		widen to `unsigned int's, and the subtraction is
		done in unsigned arithmetic, just as in K&R C.

(Note the complications in 2a, due to the so-called `value preserving'
rules, which are Wrong, but are now Engraved in Stone.  Oh well.)

The complication I was onsidering in <14421@dog.ee.lbl.gov> occurs
in case 2b.  Suppose, for instance, that char and int are both 16
bits, and that we have two strings made up of characters (32761,0)
and (1,0) respectively.  Then the comparison will return

	(int)((unsigned char)32761 - (unsigned char)1)

(the `int' cast is provided by the `return').  This will be equal to

	(int)((unsigned)32761 - (unsigned)1)

or

	(int)((unsigned)32760)

or, in 2's complement, -8.  A return value of -8 says that s1 < s2, yet
32761 > 1.  This is what I called `overflow' above.  I am not sure *what*
to call it, but `overflow' is wrong.

Thanks to Lasse H. Ostergaard <lhoe@id.dth.dk> for noting this.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

pdsmith@bbn.com (Peter D. Smith) (06/20/91)

In article <1991Jun18.074029.12226@panix.uucp> yanek@panix.uucp (Yanek Martinson) writes:
>For intcmp(a,b) why not just use a-b ? That will tell you if they are equal
>and if not, which is greater.
>
>For strcmp(s,t): while(*s++==*t++&&*s&&*t); return *s-*t;

a-b will NOT tell you which is larger.  Instead it will overflow and
either return the wrong answer or call the fatal-error trap and
abruptly end the experiment.

					Peter D. Smith

drh@duke.cs.duke.edu (D. Richard Hipp) (06/20/91)

A recent thread on this newsgroup has devoted a lot of bandwidth to
discussions of how to implement a fast "strcmp".  I suppose doing
strcmp quickly is necessary on occasions (such as running benchmarks :-)),
but more often than a quick strcmp, I need a *sensible* strcmp.  That
is to say, I need a strcmp that used in conjuction with sort will put
strings in what I would call "correct" alphabetical order.  Strcmp
fails at this task for two principle reasons:  1)  It places "BBB" before
"aaa".  2) It places "x10" before "x9".  Both of these decisions are
wrong, IMHO.

I have, on various occasions, implemented my own string comparison
routines which attempt to address the above deficiencies in strcmp.
(One such implementation, strpbcmp -- string compare in PhoneBook order,
is attached.)  I am not especially pleased with any of these routines,
however.  Somehow, to my mind, they like that special sense of elegance
and grace that a truly great string comparison routine should have.
I therefore request option from the net on what others think is the
one right, true, and proper way to compare strings.  Side issues that
you may wish to comment upon are 1) what do you call this great new
string comparison function?  ("strcmp" is already taken) and 2) how
might you implement it efficiently?

Responses by email to drh@cs.duke.edu will be summarized to this newsgroup
after an appropriate time interval.


Appendix:  a candidate string comparison function.
------------------------------ cut here ----------------------------------
#include <ctype.h>
/*
** Compare strings in phonebook order
*/
int strpbcmp(a,b)
char *a,*b;
{
  register int ca, cb;
  int ai, bi, cnt = 0;
  int bias = 0;

#ifdef TRACE
  printf("Comparing \"%s\" to \"%s\" yields ",a,b);
#endif
  ca = *(a++);
  cb = *(b++);
  while( ca && cb ){
    if( bias==0 ){
      if( isupper(ca) ){ ca = tolower(ca); bias--; }
      if( isupper(cb) ){ cb = tolower(cb); bias++; }
    }else{
      if( isupper(ca) ){ ca = tolower(ca); }
      if( isupper(cb) ){ cb = tolower(cb); }
    }
    if( isdigit(ca) ){
      if( cnt-->0 ){
        if( cb!=ca ) break;
      }else{
        if( !isdigit(cb) ) break;
        for(ai=0; isdigit(a[ai]); ai++);
        for(bi=0; isdigit(b[bi]); bi++);
        if( ai<bi ){ ca=0; break; }
        if( bi<ai ){ cb=0; break; }
        if( ca!=cb ) break;
        cnt = ai;
      }
    }else if( ca!=cb ){
      break;
    }
    ca = *(a++);
    cb = *(b++);
  }
  if( ca==cb ) ca += bias;
#ifdef TRACE
  printf("%d\n",ca-cb);
#endif
  return ca-cb;
}


#ifdef TEST

#define LINESIZE 40
#define LINECNT  2000
#include <stdio.h>

void main(void){
  char x[LINECNT][LINESIZE];
  int i,j;

  for(i=0; i<LINECNT && fgets(x[i],LINESIZE,stdin); i++);
  qsort(x,i,sizeof(char[LINESIZE]),strpbcmp);
  for(j=0; j<i; j++) fputs(x[j],stdout);
}
#endif

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/21/91)

In article <14498@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
> >		return (*(unsigned char *)--s1 - *(unsigned char *)s2);
> >... this assumes that the subtraction will not overflow.
> The word `overflow' is wrong.  The subtraction cannot overflow.  There
> are several cases:
  [ discussion ]
> Suppose, for instance, that char and int are both 16
> bits, and that we have two strings made up of characters (32761,0)
> and (1,0) respectively.

Wait a minute. First of all, if unsigned chars range from 0 to 65535,
then subtracting two unsigned char values will give a range of -65535 to
-1 for s1 < s2, 0 for s1 == s2, and 1 to 65535 for s1 > s2. There's no
way these 131071 values can fit into a 16-bit int without further
processing.

On the other hand, how can sizeof(char) == sizeof(int)? I was under the
impression that sizeof(char) had to be smaller than sizeof(int).
Otherwise the getchar() interface explodes. If sizeof(char) <
sizeof(int) and both are some number of bits, then the unsigned char
subtraction will fit into int without overflow---provided you write it
correctly:

	return ((int) (unsigned int) *(unsigned char *) --s1)
	     - ((int) (unsigned int) *(unsigned char *) s2);
	
This works, right?

---Dan

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (06/21/91)

In article <677424380@romeo.cs.duke.edu>, drh@duke.cs.duke.edu (D. Richard Hipp) writes:
> I have, on various occasions, implemented my own string comparison
> routines which attempt to address the above deficiencies in strcmp.
> (One such implementation, strpbcmp -- string compare in PhoneBook order,
> is attached.)

The routine posted does NOT compare strings in Phone Book order.
Here are the rules from a Phone Book:
    Names are divided into two parts for sorting.
    The first part, or the first word, determines the place to
    find the name.  The second part, all the initials or
    remaining words (including locality and telephone number)
    determine the order within that group.

    Business names which begin with "the" are generally sorted
    under the next word.

    Punctiation and special characters within a name will generally
    not alter their alphabetical position and should be ignored.

    When initials precede a name, they will be treated as the first
    name, regardless of punctuation.

    If the name contains a number, the numeric character will be
    sorted as though it were a word (i.e. 1 = one).  In some cases,
    names which commence with numerals will be found under the name
    as it is pronounced.

    A prefix is included as part of the first word even if it is
    separated from the second part of the name by a hyphen.
    (This one is _really_ fun.  You have to know that "Le Blanc"
    has a prefix "Le" while "Le Tseung" probably hasn't, so that
    the latter name precedes the first.)

    Names which contain a hyphen are treated as two words and are
    sorted according to the first name.  This does not apply to
    hyphenated names which begin with a prefix.

    "Mc" is treated as though spelt "Mac".  Names such as "Mace"
    and "Mack" are sorted with those names which commence with
    "Mc" and "Mac".
    "Mt" is treated as though spelt "Mount"  Names such as "Mount"
    appear first, followed by names which have "Mt" or "Mount" as
    the first part of their name.
    Names beginning with "St" are treated as though beginning
    with "Saint" (same rules as Mt/Mount).

This isn't really adequate; McDonald may also be spelled M'Donald,
and "St" is sometimes abbreviated to S, so "S. Adam Parish School"
should be sorted with "Saint-Adam", but isn't.

It is worth noting that 'phone book order is not the same as dictionary
order.  There really wasn't any one order that C could have used.

> I therefore request option from the net on what others think is the
> one right, true, and proper way to compare strings.

There isn't any.  You might like to imitate the approach in ANSI C.
There are two functions which give you access to the local collating
method (see setlocale() / LC_COLLATE).  There is a function strxfrm():
	strxfrm(dest, source, /*length? I forget*/)
produces in dest a ``normalised'' copy of source, and returns the
length of this copy.  Comparing two normalised copies using strcmp()
then does the right thing.
	strcoll(s1, s2)
has the same effect as normlising s1 and s2 separately, then comparing
them with strcmp.  What you want to do is to provide any number of
normalising functions that take your fancy, and use strcmp() to
compare normalised results.  If you do it this way, then you can also
use your comparison method with an external sort:  when you write the
file to be sorted, put the normalised version first, then a mark, then
the real data.  Sort (letting the external sort use the same rule as
strcmp), then strip off the normalised prefixes.

Note:  when you are sorting, you want the very fastest comparison you
can get.  Sorting a bunch of names by normalising them, then sorting
the normalised versions using strcmp(), is going to be a *LOT* faster
than sorting using your strpbcmp or anything like it.
-- 
I agree with Jim Giles about many of the deficiencies of present UNIX.

cjc@ulysses.att.com (Chris Calabrese) (06/22/91)

In article <6476.Jun2023.30.3491@kramden.acf.nyu.edu>, brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
|> On the other hand, how can sizeof(char) == sizeof(int)? I was under the
|> impression that sizeof(char) had to be smaller than sizeof(int).
|> Otherwise the getchar() interface explodes. If sizeof(char) <
|> sizeof(int) and both are some number of bits, then the unsigned char
|> subtraction will fit into int without overflow---provided you write it
|> correctly:
|> 
|> 	return ((int) (unsigned int) *(unsigned char *) --s1)
|> 	     - ((int) (unsigned int) *(unsigned char *) s2);
|> 	
|> This works, right?
|> 
|> ---Dan

ANSI makes the assertions that:
	CHAR_BIT (num of bits in a char) >= 8
	sizeof(char) <= sizeof(short) <= sizeof(ing) <= sizeof(long)
	a short is at least 16 bits
	a long is at least 32 bits

Strictly speaking, there's no guarantee that sizeof(char) < sizeof(long),
though K&R 2 does mention stuff about char's being the right size for
holding 1 character of the local character so it certainly could be
16 bits on a machine which has an oriental character set.
Conceivably, int could also be the same size.

-- 
Name:			Christopher J. Calabrese
Brain loaned to:	AT&T Bell Laboratories, Murray Hill, NJ
att!ulysses!cjc		cjc@ulysses.att.com
Obligatory Quote:	``pher - gr. vb. to schlep.  phospher - to schlep light.philosopher - to schlep thoughts.''

volpe@camelback.crd.ge.com (Christopher R Volpe) (06/22/91)

In article <6476.Jun2023.30.3491@kramden.acf.nyu.edu>,
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
|>On the other hand, how can sizeof(char) == sizeof(int)? I was under the
|>impression that sizeof(char) had to be smaller than sizeof(int).
|>Otherwise the getchar() interface explodes. 

I know of at least one architecture where sizeof(char)==sizeof(int) == 32 bits.
(Although the only compiler for it that I'm aware of is freestanding.)
All that's required is that sizeof(char) be no bigger than sizeof(int).
This won't break the getchar() interface, I believe, because getchar() doesn't
necessarily have to return all values of type char. It has to return all 
characters in the machine's basic character set, plus EOF. So on this machine
the bit patterns for char values -1 through 255 would all be distinct.

-Chris

|>
|>---Dan
                             
==================
Chris Volpe
G.E. Corporate R&D
volpecr@crd.ge.com

mouse@thunder.mcrcim.mcgill.edu (der Mouse) (06/22/91)

In article <1991Jun18.074029.12226@panix.uucp>, yanek@panix.uucp (Yanek Martinson) writes:
> For intcmp(a,b) why not just use a-b ?

In a word, overflow.

> That will tell you if they are equal and if not, which is greater.

Only when a-b doesn't overflow.  On a machine with 16-bit
two's-complement ints, consider a = -32000 and b = 32000.  Then a < b
(I trust nobody disagrees with this :-), so we want a negative result.
Unfortunately a-b overflows.  If it actually produces a result (as
opposed to a crash at run-time), the most likely thing to get, I would
say, would be 1536, which is not exactly negative.

> For strcmp(s,t): while(*s++==*t++&&*s&&*t); return *s-*t;

I think this has been sufficiently dissected already, so I'll skip it.

					der Mouse

			old: mcgill-vision!mouse
			new: mouse@larry.mcrcim.mcgill.edu

henry@zoo.toronto.edu (Henry Spencer) (06/26/91)

In article <677424380@romeo.cs.duke.edu> drh@duke.cs.duke.edu (D. Richard Hipp) writes:
>... I need a strcmp that used in conjuction with sort will put
>strings in what I would call "correct" alphabetical order...

As others have observed, you're going to have to write your own, because
there is no universal definition of "correct" alphabetical order.  (For
example, phonebook order != library card-catalog order != dictionary order.)
-- 
"We're thinking about upgrading from    | Henry Spencer @ U of Toronto Zoology
SunOS 4.1.1 to SunOS 3.5."              |  henry@zoo.toronto.edu  utzoo!henry

mouse@thunder.mcrcim.mcgill.edu (der Mouse) (06/26/91)

In article <67790004@hpcupt1.cup.hp.com>, thomasw@hpcupt1.cup.hp.com (Thomas Wang) writes:

> #define COMP_GREATER   (1)
> #define COMP_EQUAL     (0)
> #define COMP_LESS      (-1)
> #define COMP_UNORDERED (MININT)

> #define LE(x) (-(x) >= COMP_EQUAL) /* less than or equal, -MININT == MININT */

-MININT is not necessarily equal to MININT.  Suppose integers are
sign-magnitude, or ones-complement, or balanced-ternary[%]....

[%] This is legal, though painful ugliness is mandated to preserve the
    appearance of binary for numbers from 0 to INT_MAX.

					der Mouse

			old: mcgill-vision!mouse
			new: mouse@larry.mcrcim.mcgill.edu

henry@zoo.toronto.edu (Henry Spencer) (06/27/91)

I wrote:
>...phonebook order != library card-catalog order != dictionary order...

To cap it off, a friend of mine points out that there is more than one set
of rules for phonebook order, and I wouldn't be surprised if the same is
true of the other two examples I cited...
-- 
"We're thinking about upgrading from    | Henry Spencer @ U of Toronto Zoology
SunOS 4.1.1 to SunOS 3.5."              |  henry@zoo.toronto.edu  utzoo!henry