jk@cs.man.ac.uk (John Kewley ICL) (06/14/91)
Are there any examples of source code for strcmp, I need to implement code for a large variety of types in the style of strcmp and would like to see what has been done before. I know this seems like a simple school exercise, but I would like to see various implementations. e.g. intcmp : ((a>b)+(a>=b))-1 or (a>=b>?(a>b):-1 How much quicker is the second one likely to be? It also seems "nicer" as well, adding two boolean values is not very "nice". -- J.K. John M. Kewley, ICL, Wenlock Way, West Gorton, Manchester. M12 5DR Tel: (+44) 61 223 1301 X2138 Email: jk@cs.man.ac.uk / jk@nw.stl.stc.co.uk
yanek@panix.uucp (Yanek Martinson) (06/18/91)
For intcmp(a,b) why not just use a-b ? That will tell you if they are equal and if not, which is greater. For strcmp(s,t): while(*s++==*t++&&*s&&*t); return *s-*t;
henry@zoo.toronto.edu (Henry Spencer) (06/18/91)
In article <1991Jun18.074029.12226@panix.uucp> yanek@panix.uucp (Yanek Martinson) writes: >For intcmp(a,b) why not just use a-b ? That will tell you if they are equal >and if not, which is greater. If the subtraction doesn't overflow, that is. (The effects of overflow are undefined, and strange things can happen even with old compilers.) >For strcmp(s,t): while(*s++==*t++&&*s&&*t); return *s-*t; Apart from being inefficient, this code is wrong, since the final comparison must be done as if characters are unsigned. (This is not precisely the rule I would have picked, but at least ANSI C did pick something specific...) That's aside from the fact that the final comparison is being done on the two characters *after* the mismatch rather than on the mismatch itself... -- "We're thinking about upgrading from | Henry Spencer @ U of Toronto Zoology SunOS 4.1.1 to SunOS 3.5." | henry@zoo.toronto.edu utzoo!henry
torek@elf.ee.lbl.gov (Chris Torek) (06/19/91)
>In article <1991Jun18.074029.12226@panix.uucp> >yanek@panix.uucp (Yanek Martinson) writes: >>For strcmp(s,t): while(*s++==*t++&&*s&&*t); return *s-*t; In article <1991Jun18.153653.1494@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >Apart from being inefficient, this code is wrong, since the final comparison >must be done as if characters are unsigned. ... That's aside from the fact >that the final comparison is being done on the two characters *after* the >mismatch rather than on the mismatch itself... ... which may not even exist (if the loop stopped on '\0'). The usual way to write strcmp, if one is not interested in working hard ---though one should be; a fast strcmp improves one's Dhrystone benchmarks, which as we all know is more important than speeding up real programs :-) ---is this: int strcmp(char *s1, char *s2) { /* * Loop until different, but stop if both are '\0'. */ while (*s1++ == *s2) if (*s2++ == '\0') return (0); /* * *--s1 and *s2 differ. One might be '\0'. */ return (*(unsigned char *)--s1 - *(unsigned char *)s2); } although this assumes that the subtraction will not overflow. -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
thomasw@hpcupt1.cup.hp.com (Thomas Wang) (06/19/91)
> Are there any examples of source code for strcmp, I need to implement > code for a large variety of types in the style of strcmp and > would like to see what has been done before. > I know this seems like a simple school exercise, but I would like to see > various implementations. In my opinion, the +1, 0, and -1 comparisons are flawed, because they do not take account unordered comparisons. So I usually define 6 macros. #define COMP_GREATER (1) #define COMP_EQUAL (0) #define COMP_LESS (-1) #define COMP_UNORDERED (MININT) #define GE(x) ((x) >= COMP_EQUAL) /* greater than or equal */ #define LE(x) (-(x) >= COMP_EQUAL) /* less than or equal, -MININT == MININT */ This system is compatible with strcmp(), yet able to deal with unordered comparison. For example, compare 1.0 against NaNs should return COMP_UNORDERED. int32 intcmp(int a, int b) { return((a == b) ? COMP_EQUAL : ((a > b) ? COMP_GREATER : COMP_LESS)); } -Thomas Wang (Everything is an object.) wang@hpdmsjlm.cup.hp.com thomasw@hpcupt1.cup.hp.com
ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (06/19/91)
In article <1991Jun18.074029.12226@panix.uucp>, yanek@panix.uucp (Yanek Martinson) writes: > For strcmp(s,t): while(*s++==*t++&&*s&&*t); return *s-*t; Sorry, there's a mistake. Let p = "\0followed by 50 million bytes of junk"; Then strcmp(p,p) will set s=p, t=p, find *s == *t == '\0', increment s and t, and because the _next_ bytes aren't \0 it will keep going. Not good. In fact, if you call strcmp("\0a", "\0b") with that definition, it will claim they _aren't_ equal, but they are both empty! Keep it simple: while (*s == *t && *s != '\0' && *t != '\0') s++, t++; return *s - *t; Much as I hate to say this, a good compiler is likely to do just as well from this as from anything cleverer. -- Q: What should I know about quicksort? A: That it is *slow*. Q: When should I use it? A: When you have only 256 words of main storage.
torek@elf.ee.lbl.gov (Chris Torek) (06/20/91)
In article <14421@dog.ee.lbl.gov> I wrote: > int strcmp(char *s1, char *s2) { This should be int strcmp(const char *s1, const char *s2) { Thanks to Matthew Farwell <dylan@ibmpcug.co.uk> for pointing this out. > return (*(unsigned char *)--s1 - *(unsigned char *)s2); >... this assumes that the subtraction will not overflow. The word `overflow' is wrong. The subtraction cannot overflow. There are several cases: 1. K&R C: `unsigned char's widen to `unsigned int's. Then the subtraction is done in unsigned arithmetic, which is modular arithmetic in 2^(number of bits). 2. ANSI C: 2a: sizeof(char) < sizeof(int). Then `unsigned char's widen to signed `int's that are nonetheless nonnegative. If the underlying system is 2's complement, 1's complement, or sign-magnitude, the subtraction will not overflow. (If it is something else, I am not sure what happens. With any luck, it does not overflow.) 2b: sizeof(char) == sizeof(int). Then `unsigned char's widen to `unsigned int's, and the subtraction is done in unsigned arithmetic, just as in K&R C. (Note the complications in 2a, due to the so-called `value preserving' rules, which are Wrong, but are now Engraved in Stone. Oh well.) The complication I was onsidering in <14421@dog.ee.lbl.gov> occurs in case 2b. Suppose, for instance, that char and int are both 16 bits, and that we have two strings made up of characters (32761,0) and (1,0) respectively. Then the comparison will return (int)((unsigned char)32761 - (unsigned char)1) (the `int' cast is provided by the `return'). This will be equal to (int)((unsigned)32761 - (unsigned)1) or (int)((unsigned)32760) or, in 2's complement, -8. A return value of -8 says that s1 < s2, yet 32761 > 1. This is what I called `overflow' above. I am not sure *what* to call it, but `overflow' is wrong. Thanks to Lasse H. Ostergaard <lhoe@id.dth.dk> for noting this. -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
pdsmith@bbn.com (Peter D. Smith) (06/20/91)
In article <1991Jun18.074029.12226@panix.uucp> yanek@panix.uucp (Yanek Martinson) writes: >For intcmp(a,b) why not just use a-b ? That will tell you if they are equal >and if not, which is greater. > >For strcmp(s,t): while(*s++==*t++&&*s&&*t); return *s-*t; a-b will NOT tell you which is larger. Instead it will overflow and either return the wrong answer or call the fatal-error trap and abruptly end the experiment. Peter D. Smith
drh@duke.cs.duke.edu (D. Richard Hipp) (06/20/91)
A recent thread on this newsgroup has devoted a lot of bandwidth to discussions of how to implement a fast "strcmp". I suppose doing strcmp quickly is necessary on occasions (such as running benchmarks :-)), but more often than a quick strcmp, I need a *sensible* strcmp. That is to say, I need a strcmp that used in conjuction with sort will put strings in what I would call "correct" alphabetical order. Strcmp fails at this task for two principle reasons: 1) It places "BBB" before "aaa". 2) It places "x10" before "x9". Both of these decisions are wrong, IMHO. I have, on various occasions, implemented my own string comparison routines which attempt to address the above deficiencies in strcmp. (One such implementation, strpbcmp -- string compare in PhoneBook order, is attached.) I am not especially pleased with any of these routines, however. Somehow, to my mind, they like that special sense of elegance and grace that a truly great string comparison routine should have. I therefore request option from the net on what others think is the one right, true, and proper way to compare strings. Side issues that you may wish to comment upon are 1) what do you call this great new string comparison function? ("strcmp" is already taken) and 2) how might you implement it efficiently? Responses by email to drh@cs.duke.edu will be summarized to this newsgroup after an appropriate time interval. Appendix: a candidate string comparison function. ------------------------------ cut here ---------------------------------- #include <ctype.h> /* ** Compare strings in phonebook order */ int strpbcmp(a,b) char *a,*b; { register int ca, cb; int ai, bi, cnt = 0; int bias = 0; #ifdef TRACE printf("Comparing \"%s\" to \"%s\" yields ",a,b); #endif ca = *(a++); cb = *(b++); while( ca && cb ){ if( bias==0 ){ if( isupper(ca) ){ ca = tolower(ca); bias--; } if( isupper(cb) ){ cb = tolower(cb); bias++; } }else{ if( isupper(ca) ){ ca = tolower(ca); } if( isupper(cb) ){ cb = tolower(cb); } } if( isdigit(ca) ){ if( cnt-->0 ){ if( cb!=ca ) break; }else{ if( !isdigit(cb) ) break; for(ai=0; isdigit(a[ai]); ai++); for(bi=0; isdigit(b[bi]); bi++); if( ai<bi ){ ca=0; break; } if( bi<ai ){ cb=0; break; } if( ca!=cb ) break; cnt = ai; } }else if( ca!=cb ){ break; } ca = *(a++); cb = *(b++); } if( ca==cb ) ca += bias; #ifdef TRACE printf("%d\n",ca-cb); #endif return ca-cb; } #ifdef TEST #define LINESIZE 40 #define LINECNT 2000 #include <stdio.h> void main(void){ char x[LINECNT][LINESIZE]; int i,j; for(i=0; i<LINECNT && fgets(x[i],LINESIZE,stdin); i++); qsort(x,i,sizeof(char[LINESIZE]),strpbcmp); for(j=0; j<i; j++) fputs(x[j],stdout); } #endif
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/21/91)
In article <14498@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: > > return (*(unsigned char *)--s1 - *(unsigned char *)s2); > >... this assumes that the subtraction will not overflow. > The word `overflow' is wrong. The subtraction cannot overflow. There > are several cases: [ discussion ] > Suppose, for instance, that char and int are both 16 > bits, and that we have two strings made up of characters (32761,0) > and (1,0) respectively. Wait a minute. First of all, if unsigned chars range from 0 to 65535, then subtracting two unsigned char values will give a range of -65535 to -1 for s1 < s2, 0 for s1 == s2, and 1 to 65535 for s1 > s2. There's no way these 131071 values can fit into a 16-bit int without further processing. On the other hand, how can sizeof(char) == sizeof(int)? I was under the impression that sizeof(char) had to be smaller than sizeof(int). Otherwise the getchar() interface explodes. If sizeof(char) < sizeof(int) and both are some number of bits, then the unsigned char subtraction will fit into int without overflow---provided you write it correctly: return ((int) (unsigned int) *(unsigned char *) --s1) - ((int) (unsigned int) *(unsigned char *) s2); This works, right? ---Dan
ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (06/21/91)
In article <677424380@romeo.cs.duke.edu>, drh@duke.cs.duke.edu (D. Richard Hipp) writes: > I have, on various occasions, implemented my own string comparison > routines which attempt to address the above deficiencies in strcmp. > (One such implementation, strpbcmp -- string compare in PhoneBook order, > is attached.) The routine posted does NOT compare strings in Phone Book order. Here are the rules from a Phone Book: Names are divided into two parts for sorting. The first part, or the first word, determines the place to find the name. The second part, all the initials or remaining words (including locality and telephone number) determine the order within that group. Business names which begin with "the" are generally sorted under the next word. Punctiation and special characters within a name will generally not alter their alphabetical position and should be ignored. When initials precede a name, they will be treated as the first name, regardless of punctuation. If the name contains a number, the numeric character will be sorted as though it were a word (i.e. 1 = one). In some cases, names which commence with numerals will be found under the name as it is pronounced. A prefix is included as part of the first word even if it is separated from the second part of the name by a hyphen. (This one is _really_ fun. You have to know that "Le Blanc" has a prefix "Le" while "Le Tseung" probably hasn't, so that the latter name precedes the first.) Names which contain a hyphen are treated as two words and are sorted according to the first name. This does not apply to hyphenated names which begin with a prefix. "Mc" is treated as though spelt "Mac". Names such as "Mace" and "Mack" are sorted with those names which commence with "Mc" and "Mac". "Mt" is treated as though spelt "Mount" Names such as "Mount" appear first, followed by names which have "Mt" or "Mount" as the first part of their name. Names beginning with "St" are treated as though beginning with "Saint" (same rules as Mt/Mount). This isn't really adequate; McDonald may also be spelled M'Donald, and "St" is sometimes abbreviated to S, so "S. Adam Parish School" should be sorted with "Saint-Adam", but isn't. It is worth noting that 'phone book order is not the same as dictionary order. There really wasn't any one order that C could have used. > I therefore request option from the net on what others think is the > one right, true, and proper way to compare strings. There isn't any. You might like to imitate the approach in ANSI C. There are two functions which give you access to the local collating method (see setlocale() / LC_COLLATE). There is a function strxfrm(): strxfrm(dest, source, /*length? I forget*/) produces in dest a ``normalised'' copy of source, and returns the length of this copy. Comparing two normalised copies using strcmp() then does the right thing. strcoll(s1, s2) has the same effect as normlising s1 and s2 separately, then comparing them with strcmp. What you want to do is to provide any number of normalising functions that take your fancy, and use strcmp() to compare normalised results. If you do it this way, then you can also use your comparison method with an external sort: when you write the file to be sorted, put the normalised version first, then a mark, then the real data. Sort (letting the external sort use the same rule as strcmp), then strip off the normalised prefixes. Note: when you are sorting, you want the very fastest comparison you can get. Sorting a bunch of names by normalising them, then sorting the normalised versions using strcmp(), is going to be a *LOT* faster than sorting using your strpbcmp or anything like it. -- I agree with Jim Giles about many of the deficiencies of present UNIX.
cjc@ulysses.att.com (Chris Calabrese) (06/22/91)
In article <6476.Jun2023.30.3491@kramden.acf.nyu.edu>, brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: |> On the other hand, how can sizeof(char) == sizeof(int)? I was under the |> impression that sizeof(char) had to be smaller than sizeof(int). |> Otherwise the getchar() interface explodes. If sizeof(char) < |> sizeof(int) and both are some number of bits, then the unsigned char |> subtraction will fit into int without overflow---provided you write it |> correctly: |> |> return ((int) (unsigned int) *(unsigned char *) --s1) |> - ((int) (unsigned int) *(unsigned char *) s2); |> |> This works, right? |> |> ---Dan ANSI makes the assertions that: CHAR_BIT (num of bits in a char) >= 8 sizeof(char) <= sizeof(short) <= sizeof(ing) <= sizeof(long) a short is at least 16 bits a long is at least 32 bits Strictly speaking, there's no guarantee that sizeof(char) < sizeof(long), though K&R 2 does mention stuff about char's being the right size for holding 1 character of the local character so it certainly could be 16 bits on a machine which has an oriental character set. Conceivably, int could also be the same size. -- Name: Christopher J. Calabrese Brain loaned to: AT&T Bell Laboratories, Murray Hill, NJ att!ulysses!cjc cjc@ulysses.att.com Obligatory Quote: ``pher - gr. vb. to schlep. phospher - to schlep light.philosopher - to schlep thoughts.''
volpe@camelback.crd.ge.com (Christopher R Volpe) (06/22/91)
In article <6476.Jun2023.30.3491@kramden.acf.nyu.edu>, brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: |>On the other hand, how can sizeof(char) == sizeof(int)? I was under the |>impression that sizeof(char) had to be smaller than sizeof(int). |>Otherwise the getchar() interface explodes. I know of at least one architecture where sizeof(char)==sizeof(int) == 32 bits. (Although the only compiler for it that I'm aware of is freestanding.) All that's required is that sizeof(char) be no bigger than sizeof(int). This won't break the getchar() interface, I believe, because getchar() doesn't necessarily have to return all values of type char. It has to return all characters in the machine's basic character set, plus EOF. So on this machine the bit patterns for char values -1 through 255 would all be distinct. -Chris |> |>---Dan ================== Chris Volpe G.E. Corporate R&D volpecr@crd.ge.com
mouse@thunder.mcrcim.mcgill.edu (der Mouse) (06/22/91)
In article <1991Jun18.074029.12226@panix.uucp>, yanek@panix.uucp (Yanek Martinson) writes: > For intcmp(a,b) why not just use a-b ? In a word, overflow. > That will tell you if they are equal and if not, which is greater. Only when a-b doesn't overflow. On a machine with 16-bit two's-complement ints, consider a = -32000 and b = 32000. Then a < b (I trust nobody disagrees with this :-), so we want a negative result. Unfortunately a-b overflows. If it actually produces a result (as opposed to a crash at run-time), the most likely thing to get, I would say, would be 1536, which is not exactly negative. > For strcmp(s,t): while(*s++==*t++&&*s&&*t); return *s-*t; I think this has been sufficiently dissected already, so I'll skip it. der Mouse old: mcgill-vision!mouse new: mouse@larry.mcrcim.mcgill.edu
henry@zoo.toronto.edu (Henry Spencer) (06/26/91)
In article <677424380@romeo.cs.duke.edu> drh@duke.cs.duke.edu (D. Richard Hipp) writes: >... I need a strcmp that used in conjuction with sort will put >strings in what I would call "correct" alphabetical order... As others have observed, you're going to have to write your own, because there is no universal definition of "correct" alphabetical order. (For example, phonebook order != library card-catalog order != dictionary order.) -- "We're thinking about upgrading from | Henry Spencer @ U of Toronto Zoology SunOS 4.1.1 to SunOS 3.5." | henry@zoo.toronto.edu utzoo!henry
mouse@thunder.mcrcim.mcgill.edu (der Mouse) (06/26/91)
In article <67790004@hpcupt1.cup.hp.com>, thomasw@hpcupt1.cup.hp.com (Thomas Wang) writes: > #define COMP_GREATER (1) > #define COMP_EQUAL (0) > #define COMP_LESS (-1) > #define COMP_UNORDERED (MININT) > #define LE(x) (-(x) >= COMP_EQUAL) /* less than or equal, -MININT == MININT */ -MININT is not necessarily equal to MININT. Suppose integers are sign-magnitude, or ones-complement, or balanced-ternary[%].... [%] This is legal, though painful ugliness is mandated to preserve the appearance of binary for numbers from 0 to INT_MAX. der Mouse old: mcgill-vision!mouse new: mouse@larry.mcrcim.mcgill.edu
henry@zoo.toronto.edu (Henry Spencer) (06/27/91)
I wrote:
>...phonebook order != library card-catalog order != dictionary order...
To cap it off, a friend of mine points out that there is more than one set
of rules for phonebook order, and I wouldn't be surprised if the same is
true of the other two examples I cited...
--
"We're thinking about upgrading from | Henry Spencer @ U of Toronto Zoology
SunOS 4.1.1 to SunOS 3.5." | henry@zoo.toronto.edu utzoo!henry