dick@slvblc.UUCP (Dick Flanagan) (02/14/88)
In article <2118@bsu-cs.UUCP> dhesi@bsu-cs.UUCP (Rahul Dhesi) writes: >In article <241@oracle.UUCP> rbradbur@oracle.UUCP (Robert Bradbury) writes: >>On another note; does everyone realize that the current standard allows >>the results of the str/memcmp() function to be implementation defined >>if the characters being compared have the high-bit set? > >You mean that I can have two identical strings with high bits set, and >strcmp() could return something other than 0? No. Equal is equal. >Or does the problem lie only in deciding lexical order? The problem lies in that the developer of the run-time routines is free to decide that strcmp() is comparing *signed* eight-bit numbers, so that any character with the high-bit set is considered to be *lower* than any character with the high-bit off. This would mean that the non-equal returns could differ between one compiler and another. >While we're on the subject, just what is the meaning of "implementation- >defined"? "Left to the discretion of the compiler run-time routine developer." Dick -- Dick Flanagan, W6OLD GEnie: FLANAGAN UUCP: ...!ucbvax!ucscc!slvblc!dick Voice: +1 408 336 3481 INTERNET: slvblc!dick@ucscc.UCSC.EDU LORAN: N037 05.5 W122 05.2 USPO: PO Box 155, Ben Lomond, CA 95005
henry@utzoo.uucp (Henry Spencer) (02/16/88)
> The problem lies in that the developer of the run-time routines is free > to decide that strcmp() is comparing *signed* eight-bit numbers... The situation actually gets worse. Consider strcmp("a\200", "a"). Is its value positive or negative? The orthodox rule of lexical ordering says it should be positive, because strlen("a\200") > strlen("a") and strncmp("a\200", "a", strlen("a")) == 0. That is, the '\0' that terminates the string should not participate in comparisons, and it is irrelevant whether '\203' < '\0' on a signed-char machine. Existing implementations often get this wrong. The X3J11 draft appears to permit this. (The wording is not quite specific enough for me to be certain.) -- Those who do not understand Unix are | Henry Spencer @ U of Toronto Zoology condemned to reinvent it, poorly. | {allegra,ihnp4,decvax,utai}!utzoo!henry
wnp@dcs.UUCP (Wolf N. Paul) (02/17/88)
In article <2118@bsu-cs.UUCP> dhesi@bsu-cs.UUCP (Rahul Dhesi) writes: >In article <241@oracle.UUCP> rbradbur@oracle.UUCP (Robert Bradbury) writes: >>On another note; does everyone realize that the current standard allows >>the results of the str/memcmp() function to be implementation defined >>if the characters being compared have the high-bit set? > >You mean that I can have two identical strings with high bits set, and >strcmp() could return something other than 0? This would be truly a >serious flaw in the standard. The purpose of this would be to allow the use of the "alternate" character set (= codes > 127) to be used for international language applications. Languages which have more than 26 alpha characters need the upper half of the eight-bit code range to implement their languages, and in that case ignoring the 8th bit would be very counter-productive. >Or does the problem lie only in deciding lexical order? That's not so >bad--I wouldn't trust the standard library in that case anyway, since >we all have our own opinions about what lexical order is correct. Similarly, we all have our own opinions about how many unique characters should be available -- limiting the number to 128 is more restrictive than necessary, so I think the standard is appropriate. >While we're on the subject, just what is the meaning of "implementation- >defined"? E.g., "Oh, by the way, I would avoid comparing strings >with high bits set, since the result is implementation-defined. On >this particular system the result is that all disk packs are erased and >the system halts." Hardly. Compiler implementors do want to sell more than a few copies of their product, and such behaviour would not be conducive to that goal :-) >-- >Rahul Dhesi UUCP: <backbones>!{iuvax,pur-ee,uunet}!bsu-cs!dhesi ---- Wolf N. Paul Phone: (214) 306-9101 (h) (214) 404-8077 (w) 3387 Sam Rayburn Run UUCP: ihnp4!killer!{dcs, doulos}!wnp Carrollton, TX 75007 INTERNET: wnp@dcs.UUCP ESL: 62832882 Pat Robertson does NOT speak for all evangelical Christians--not for me, anyway! -- Wolf N. Paul Phone: (214) 306-9101 (h) (214) 404-8077 (w) 3387 Sam Rayburn Run UUCP: ihnp4!killer!{dcs, doulos}!wnp Carrollton, TX 75007 INTERNET: wnp@dcs.UUCP ESL: 62832882 Pat Robertson does NOT speak for all evangelical Christians--not for me, anyway!
pardo@june.cs.washington.edu (David Keppel) (02/18/88)
[ Define implementation-defined ] > >with high bits set, since the result is implementation-defined. On > >this particular system [that] result is that all disk packs are erased and > >the system halts." > >Hardly. Compiler implementors do want to sell more than a few copies >of their product, and such behaviour would not be conducive to that goal :-) :-( See the recent discussion about (a) no 8086 protection and disk-drive tables borked by erring large-model programs (a historical accident) and (b) the Microsoft "optimize by breaking" 5.0 compiler. ;-D on (That's why I'll use a pencil instead of OS/2) Pardo
cjc@ulysses.homer.nj.att.com (Chris Calabrese[rs]) (02/18/88)
In article <16@dcs.UUCP>, wnp@dcs.UUCP writes: > In article <2118@bsu-cs.UUCP> dhesi@bsu-cs.UUCP (Rahul Dhesi) writes: > >In article <241@oracle.UUCP> rbradbur@oracle.UUCP (Robert Bradbury) writes: > >>On another note; does everyone realize that the current standard allows > >>the results of the str/memcmp() function to be implementation defined > >>if the characters being compared have the high-bit set? > > The purpose of this would be to allow the use of the "alternate" character > set (= codes > 127) to be used for international language applications. > Languages which have more than 26 alpha characters need the upper half > of the eight-bit code range to implement their languages, and in that > case ignoring the 8th bit would be very counter-productive. If ansi wants this to really work, they'll have to allow for 16 bit char's, the standard in Japanese and Chinese language word processors. There is still a problem with using the 8th bit, as many machines generate strict parity for character work. Assumably, the lexical ordering probelem can be eliminated by stripping the 8th bit before comparison, or better yet, 15 bit char's with 1 bit parity, or any other combo. Chris Calabrese AT&T Bell Labs ulysses!cjc
gwyn@brl-smoke.ARPA (Doug Gwyn ) (02/19/88)
In article <10095@ulysses.homer.nj.att.com> cjc@ulysses.homer.nj.att.com (Chris Calabrese[rs]) writes: >In article <16@dcs.UUCP>, wnp@dcs.UUCP writes: >> In article <2118@bsu-cs.UUCP> dhesi@bsu-cs.UUCP (Rahul Dhesi) writes: >> >In article <241@oracle.UUCP> rbradbur@oracle.UUCP (Robert Bradbury) writes: >> >>On another note; does everyone realize that the current standard allows >> >>the results of the str/memcmp() function to be implementation defined >> >>if the characters being compared have the high-bit set? >> The purpose of this would be to allow the use of the "alternate" character >> set (= codes > 127) to be used for international language applications. >>... >If ansi wants this to really work, they'll have to allow for >16 bit char's, the standard in Japanese and Chinese language >word processors. There is still a problem with >using the 8th bit, as many machines generate strict parity >for character work. This discussion has gone onto completely the wrong track. The reason for allowing the indeterminacy in strcmp()'s return sign when the differing characters have the high bit set is simply because that is the way C "plain" chars are, so that is in fact how existing implementations behave. The C source characters are required to appear positive, although other additional characters in an implementation can appear negative. This means that an 8-bit EBCDIC implementation would have to make "plain" chars act like unsigned chars, for example. The proposed ANSI C provides adequate (but minimal) support for "multi-byte characters" such as are used in Japan. Note that this is not the same as 16-bit chars, which are permitted but would not usually be the implementor's choice for those environments. (Even though it is conceptually and practically much cleaner than explicit multi-byte sequences, they still want to be able to handle 8-bit data too, and don't like the idea of wasted space in an international software release when it is used in an 8-bit character country.)
msb@sq.uucp (Mark Brader) (02/20/88)
The wording in the November 1987 (the second-latest) draft is: # 4.11.4 Comparison functions # The sign of the value returned by the comparison functions is # determined by the sign of the difference between the values of the # first pair of characters that differ in the objects [ i.e.the strings, in this case ] # being compared. If one of the characters has its high-order bit set, # the sign of the result is implementation-defined. ... # 4.11.4.2 The strcmp function ... # The strcmp function returns an integer greater than, equal to, or # less than zero, according as the string pointed to by [argument] s1 # is greater than, equal to, or less than the string pointed to be s2. Thus strcmp ("xy\300", "xy\100") may return a positive or negative number but not zero. The last time I read this section, I decided that the words about "first pair of characters that differ" meant that the same was true in the case of strcmp ("x\300", "x"); but now I'm not so sure. That was probably the *intent*, but one could consider it to be contradicted by the word "strings" in the last quoted sentence, taken together with the usual notion of lexical ordering. Mark Brader "C takes the point of view SoftQuad Inc., Toronto that the programmer is always right" utzoo!sq!msb, msb@sq.com -- Michael DeCorte
henry@utzoo.uucp (Henry Spencer) (02/24/88)
> ... The reason > for allowing the indeterminacy in strcmp()'s return sign when the > differing characters have the high bit set is simply because that is > the way C "plain" chars are, so that is in fact how existing implementations > behave. Unfortunately, the wording also appears to allow indeterminate results from strcmp("aX", "a") where X is a high-bit character... which is WRONG! The lexical ordering of those two strings is well-defined regardless of char signedness or collating sequence; allowing implementation-defined results here makes strcmp almost useless in high-bit environments. -- Those who do not understand Unix are | Henry Spencer @ U of Toronto Zoology condemned to reinvent it, poorly. | {allegra,ihnp4,decvax,utai}!utzoo!henry