[comp.lang.c] Bug in ANSI C??

dick@slvblc.UUCP (Dick Flanagan) (02/14/88)

In article <2118@bsu-cs.UUCP> dhesi@bsu-cs.UUCP (Rahul Dhesi) writes:
>In article <241@oracle.UUCP> rbradbur@oracle.UUCP (Robert Bradbury) writes:
>>On another note; does everyone realize that the current standard allows
>>the results of the str/memcmp() function to be implementation defined
>>if the characters being compared have the high-bit set?
>
>You mean that I can have two identical strings with high bits set, and
>strcmp() could return something other than 0?

No.  Equal is equal.

>Or does the problem lie only in deciding lexical order?

The problem lies in that the developer of the run-time routines is free
to decide that strcmp() is comparing *signed* eight-bit numbers, so that
any character with the high-bit set is considered to be *lower* than any
character with the high-bit off.  This would mean that the non-equal
returns could differ between one compiler and another.

>While we're on the subject, just what is the meaning of "implementation-
>defined"?

"Left to the discretion of the compiler run-time routine developer."

Dick

--
Dick Flanagan, W6OLD                         GEnie: FLANAGAN
UUCP: ...!ucbvax!ucscc!slvblc!dick           Voice: +1 408 336 3481
INTERNET: slvblc!dick@ucscc.UCSC.EDU         LORAN: N037 05.5 W122 05.2
USPO: PO Box 155, Ben Lomond, CA 95005

henry@utzoo.uucp (Henry Spencer) (02/16/88)

> The problem lies in that the developer of the run-time routines is free
> to decide that strcmp() is comparing *signed* eight-bit numbers...

The situation actually gets worse.  Consider strcmp("a\200", "a").  Is its
value positive or negative?  The orthodox rule of lexical ordering says
it should be positive, because strlen("a\200") > strlen("a") and
strncmp("a\200", "a", strlen("a")) == 0.  That is, the '\0' that terminates
the string should not participate in comparisons, and it is irrelevant
whether '\203' < '\0' on a signed-char machine.  Existing implementations
often get this wrong.  The X3J11 draft appears to permit this.  (The wording
is not quite specific enough for me to be certain.)
-- 
Those who do not understand Unix are |  Henry Spencer @ U of Toronto Zoology
condemned to reinvent it, poorly.    | {allegra,ihnp4,decvax,utai}!utzoo!henry

wnp@dcs.UUCP (Wolf N. Paul) (02/17/88)

In article <2118@bsu-cs.UUCP> dhesi@bsu-cs.UUCP (Rahul Dhesi) writes:
 >In article <241@oracle.UUCP> rbradbur@oracle.UUCP (Robert Bradbury) writes:
 >>On another note; does everyone realize that the current standard allows
 >>the results of the str/memcmp() function to be implementation defined
 >>if the characters being compared have the high-bit set?
 >
 >You mean that I can have two identical strings with high bits set, and
 >strcmp() could return something other than 0?  This would be truly a
 >serious flaw in the standard.

The purpose of this would be to allow the use of the "alternate" character
set (= codes > 127) to be used for international language applications.
Languages which have more than 26 alpha characters need the upper half
of the eight-bit code range to implement their languages, and in that
case ignoring the 8th bit would be very counter-productive.

 >Or does the problem lie only in deciding lexical order?  That's not so
 >bad--I wouldn't trust the standard library in that case anyway, since
 >we all have our own opinions about what lexical order is correct.

Similarly, we all have our own opinions about how many unique characters
should be available -- limiting the number to 128 is more restrictive
than necessary, so I think the standard is appropriate.

 >While we're on the subject, just what is the meaning of "implementation-
 >defined"?  E.g., "Oh, by the way, I would avoid comparing strings
 >with high bits set, since the result is implementation-defined.  On
 >this particular system the result is that all disk packs are erased and
 >the system halts."

Hardly. Compiler implementors do want to sell more than a few copies
of their product, and such behaviour would not be conducive to that goal :-)

 >-- 
 >Rahul Dhesi         UUCP:  <backbones>!{iuvax,pur-ee,uunet}!bsu-cs!dhesi
----
Wolf N. Paul                  Phone: (214) 306-9101 (h)   (214) 404-8077 (w)
3387 Sam Rayburn Run          UUCP: ihnp4!killer!{dcs, doulos}!wnp
Carrollton, TX 75007          INTERNET: wnp@dcs.UUCP       ESL:  62832882
Pat Robertson does NOT speak for all evangelical Christians--not for me, anyway!
-- 
Wolf N. Paul                  Phone: (214) 306-9101 (h)   (214) 404-8077 (w)
3387 Sam Rayburn Run          UUCP: ihnp4!killer!{dcs, doulos}!wnp
Carrollton, TX 75007          INTERNET: wnp@dcs.UUCP       ESL:  62832882
Pat Robertson does NOT speak for all evangelical Christians--not for me, anyway!

pardo@june.cs.washington.edu (David Keppel) (02/18/88)

[ Define implementation-defined ]

> >with high bits set, since the result is implementation-defined.  On
> >this particular system [that] result is that all disk packs are erased and
> >the system halts."
>
>Hardly. Compiler implementors do want to sell more than a few copies
>of their product, and such behaviour would not be conducive to that goal :-)

:-( See the recent discussion about (a) no 8086 protection and disk-drive
    tables borked by erring large-model programs (a historical accident)
    and (b) the Microsoft "optimize by breaking" 5.0 compiler.

    ;-D on  (That's why I'll use a pencil instead of OS/2)  Pardo

cjc@ulysses.homer.nj.att.com (Chris Calabrese[rs]) (02/18/88)

In article <16@dcs.UUCP>, wnp@dcs.UUCP writes:
> In article <2118@bsu-cs.UUCP> dhesi@bsu-cs.UUCP (Rahul Dhesi) writes:
>  >In article <241@oracle.UUCP> rbradbur@oracle.UUCP (Robert Bradbury) writes:
>  >>On another note; does everyone realize that the current standard allows
>  >>the results of the str/memcmp() function to be implementation defined
>  >>if the characters being compared have the high-bit set?
> 
> The purpose of this would be to allow the use of the "alternate" character
> set (= codes > 127) to be used for international language applications.
> Languages which have more than 26 alpha characters need the upper half
> of the eight-bit code range to implement their languages, and in that
> case ignoring the 8th bit would be very counter-productive.

If ansi wants this to really work, they'll have to allow for
16 bit char's, the standard in Japanese and Chinese language
word processors.  There is still a problem with
using the 8th bit, as many machines generate strict parity
for character work.  Assumably, the lexical ordering probelem
can be eliminated by stripping the 8th bit before comparison,
or better yet, 15 bit char's with 1 bit parity, or any
other combo.

	Chris Calabrese
	AT&T Bell Labs
	ulysses!cjc

gwyn@brl-smoke.ARPA (Doug Gwyn ) (02/19/88)

In article <10095@ulysses.homer.nj.att.com> cjc@ulysses.homer.nj.att.com (Chris Calabrese[rs]) writes:
>In article <16@dcs.UUCP>, wnp@dcs.UUCP writes:
>> In article <2118@bsu-cs.UUCP> dhesi@bsu-cs.UUCP (Rahul Dhesi) writes:
>>  >In article <241@oracle.UUCP> rbradbur@oracle.UUCP (Robert Bradbury) writes:
>>  >>On another note; does everyone realize that the current standard allows
>>  >>the results of the str/memcmp() function to be implementation defined
>>  >>if the characters being compared have the high-bit set?
>> The purpose of this would be to allow the use of the "alternate" character
>> set (= codes > 127) to be used for international language applications.
>>...
>If ansi wants this to really work, they'll have to allow for
>16 bit char's, the standard in Japanese and Chinese language
>word processors.  There is still a problem with
>using the 8th bit, as many machines generate strict parity
>for character work.

This discussion has gone onto completely the wrong track.  The reason
for allowing the indeterminacy in strcmp()'s return sign when the
differing characters have the high bit set is simply because that is
the way C "plain" chars are, so that is in fact how existing implementations
behave.

The C source characters are required to appear positive, although other
additional characters in an implementation can appear negative.  This
means that an 8-bit EBCDIC implementation would have to make "plain"
chars act like unsigned chars, for example.

The proposed ANSI C provides adequate (but minimal) support for "multi-byte
characters" such as are used in Japan.  Note that this is not the same as
16-bit chars, which are permitted but would not usually be the implementor's
choice for those environments.  (Even though it is conceptually and
practically much cleaner than explicit multi-byte sequences, they still want
to be able to handle 8-bit data too, and don't like the idea of wasted space
in an international software release when it is used in an 8-bit character
country.)

msb@sq.uucp (Mark Brader) (02/20/88)

The wording in the November 1987 (the second-latest) draft is:

# 4.11.4 Comparison functions
#	The sign of the value returned by the comparison functions is
# determined by the sign of the difference between the values of the
# first pair of characters that differ in the objects
  [ i.e.the strings, in this case ]
# being compared.  If one of the characters has its high-order bit set,
# the sign of the result is implementation-defined.
  ...
# 4.11.4.2 The strcmp function
  ...
# The strcmp function returns an integer greater than, equal to, or
# less than zero, according as the string pointed to by [argument] s1
# is greater than, equal to, or less than the string pointed to be s2.

Thus strcmp ("xy\300", "xy\100") may return a positive or negative
number but not zero.

The last time I read this section, I decided that the words about
"first pair of characters that differ" meant that the same was true
in the case of strcmp ("x\300", "x"); but now I'm not so sure.
That was probably the *intent*, but one could consider it to be
contradicted by the word "strings" in the last quoted sentence, taken
together with the usual notion of lexical ordering.

Mark Brader				"C takes the point of view
SoftQuad Inc., Toronto			 that the programmer is always right"
utzoo!sq!msb, msb@sq.com				-- Michael DeCorte

henry@utzoo.uucp (Henry Spencer) (02/24/88)

> ...  The reason
> for allowing the indeterminacy in strcmp()'s return sign when the
> differing characters have the high bit set is simply because that is
> the way C "plain" chars are, so that is in fact how existing implementations
> behave.

Unfortunately, the wording also appears to allow indeterminate results from
strcmp("aX", "a") where X is a high-bit character... which is WRONG!  The
lexical ordering of those two strings is well-defined regardless of char
signedness or collating sequence; allowing implementation-defined results
here makes strcmp almost useless in high-bit environments.
-- 
Those who do not understand Unix are |  Henry Spencer @ U of Toronto Zoology
condemned to reinvent it, poorly.    | {allegra,ihnp4,decvax,utai}!utzoo!henry