[net.lang.c] structure and array and string comparisons

gnu@sun.uucp (John Gilmore) (03/19/84)

If we do ANYTHING about this, we should do it right.  This means that
we can't embed "zero byte means end" into the language -- a character "string"
is simply an array of chars, and don't you forget it.  In any case,
we can't do array comparison, since an array is by definition converted
to a pointer to its first element every time it's used -- there's no way to
refer to the whole array except in sizeof.  When I want to pass an array
as an argument, I typically enclose it in a struct.  The flexibility of 
pointer==array==pointer is a big win but it has some minor losses like this.

Personally I think that avoiding structure comparison "because it's hard"
is a copout.  It's already "hard" for the C programmer who has to compare
each member of the structure individually.  For the compiler, 99% of the
cases are without holes and it's easy and efficient.

I don't propose comparing unions, or following pointers.  The comparison
s1 == s2 should be equivalent, except for storage reference pattern, to:

	( (s1.memb1 == s2.memb1) && (s1.memb2 == s2.memb2) && ...)

tim@unc.UUCP (Tim Maroney) (03/20/84)

>  If we do ANYTHING about this, we should do it right.  This means
>  that we can't embed "zero byte means end" into the language -- a
>  character "string" is simply an array of chars, and don't you
>  forget it.  In any case, we can't do array comparison, since an
>  array is by definition converted to a pointer to its first
>  element every time it's used -- there's no way to refer to the
>  whole array except in sizeof.  When I want to pass an array as an
>  argument, I typically enclose it in a struct.  The flexibility of
>  pointer==array==pointer is a big win but it has some minor losses
>  like this.

From the C Reference Manual, Sec. 2.5:  "The compiler places a null byte \0
at the end of each string so that programs which scan the string can find
its end."  Thus "zero byte means end" is already written into the language.

It is true that right now, array-typed identifiers are automatically
converted to pointers in expressions.  However, the type of the identifier
is still "array".  (C Ref. Man, 14.3: "Every time an identifier of array
type appears in an expression, it is converted into a pointer to the first
member of the array.")  However, I can see no strong objection to violating
this for expressions of new operators that we add to the language, which was
my first suggestion.

It would be just as easy, but not backward compatible, to change the rules
and use the existing operators, saying that arrays are not converted to
pointers when they are the operands of the comparison operators; instead,
the compiler will produce code to compare the arrays elementwise.  This
avoids adding new elementwise comparison operators, but it means that old
code which compares array identifiers to something would have to be changed
to include type casts to turn the array into a pointer.

In either case, there are two possible ways that you could determine where
the end of an array is.  If the identifier is a fixed-size array, then you
know how many elements there are at compile time, and you can just SOB or
AOBLSS or whatever to do a fixed number of comparisons.  If it is an array
that was declared with a null constant expression, it can be assumed that
the array is null-terminated.  Thus to compare strings you would store their
addresses in variables declared, for instance "char StringVar[]".  To
compare complex numbers, you would declare the complex variables like:

float complexVar[2];

or, more cleanly,

typedef float complex[2];
complex complexVar;

and the compiler would generate the array comparison code when you compared
two identifiers of type "complex".  This is a side effect of providing array
comparison that I hadn't realized at first; I should note that this only
works like you want it to for equality comparison, not ordering.

The idea of structure comparison also points to adding array comparison.
It's certain that if you are doing a memberwise comparison of structures
with array-typed fields, you don't want to compare the addresses of the
arrays: they will always be different if the structs are not the same block
of storage (which you should always check for first anyway if it could
happen because it is so much faster when it's true).

>  I don't propose comparing unions, or following pointers.  The
>  comparison s1 == s2 should be equivalent, except for storage
>  reference pattern, to:
>
>     ( (s1.memb1 == s2.memb1) && (s1.memb2 == s2.memb2) && ...)

I agree with you, although it might be nice to be able to specify that a
pointer should be followed -- that comes perilously close to creeping
featurism, though, and I have already listed some problems with it in my
previous article, so I'll just shut up about it....

However, it is interesting to note that if the semantics of array comparison
were changed as I suggested, then you would get the proper comparison of
array-typed structure members.
--
Tim Maroney, The Censored Hacker
mcnc!unc!tim (USENET), tim.unc@csnet-relay (ARPA)

All opinions expressed herein are completely my own, so don't go assuming
that anyone else at UNC feels the same way.

rcd@opus.UUCP (03/21/84)

<>
> Personally I think that avoiding structure comparison "because it's hard"
> is a copout.  It's already "hard" for the C programmer who has to compare
> each member of the structure individually.  For the compiler, 99% of the
> cases are without holes and it's easy and efficient.
>
> I don't propose comparing unions, or following pointers.

I'm sorry I ever slipped and understated the situation!  When I said that
comparing structures is "hard" I meant that it's hard enough to be
impractical.  Unions are perhaps the worst aspect, and you CAN'T avoid
unions if you are going to compare structures, because a structure can
contain a union.  Filling in a little bit of the reasoning:  The elements
of a union can be of differing size.  If you compare two instances of a
struct-containing-union, you have to know how much of the union to compare,
to avoid a spurious mismatch on the unused trailing part of a short field.
To do that, you have to know which field of the union is the current one,
which in turn would mean that code would be required at EVERY assignment to
a field of the union (plus or minus some very messy optimization) to keep
track of the current field.  This isn't even remotely practical with C's
undiscriminated union.

[OK, at least 5% of you are ready to suggest allowing "comparisons of
structures unless they contain unions".  Please don't.  First, this is
pretty clearly a clunky special case we don't need - but it also makes a
mess of a certain sort of data-hiding:
	struct junk {
		int a,b,c,d;
		mystery xyzzy;
	}
where "mystery" is a typedef from elsewhere.  The "junk" type shouldn't
have to care what's in type "mystery"; it just needs one.  (Data
abstraction and all that apple pie...)  However, the semantics of objects
of type "struct junk" - in particular, whether they can be compared - would
depend on whether "mystery" was, or contained, a union.]
-- 
{hao,ucbvax,allegra}!nbires!rcd

rcd@opus.UUCP (03/21/84)

<>
> In either case, there are two possible ways that you could determine where
> the end of an array is.  If the identifier is a fixed-size array, then you
> know how many elements there are at compile time, and you can just SOB or
> AOBLSS or whatever to do a fixed number of comparisons.  If it is an array
> that was declared with a null constant expression, it can be assumed that
> the array is null-terminated...

Two problems with this.  First, what does "null terminated" mean?  Fine for
arrays of char, maybe even arrays of pointers if you stretch it. (AAACCK!  I
hear the ol' null-pointer portability discussion starting up again.  Forget
I said anything about pointers!)  But what about arrays of int - what's a
"null" for that type?  Is this comparison going to be a special kluge for
chars only?  Second problem - you introduce a very funny nonuniformity of
behavior:  if the dimension is known, you get a comparison of a fixed
number of elements; if unknown, a delimiter-controlled comparison.  I can
think of situations where changing a declaration or two could really make
things go haywire.
-- 
{hao,ucbvax,allegra}!nbires!rcd

neal@denelcor.UUCP (Neal Weidenhofer) (03/23/84)

**************************************************************************

>You missed his point; to correctly compare two structures a byte-for-byte
>comparsion should *not* be done.  The bytes which are not part of structure
>members, but which are just padding to put structure members on the right
>boundary, *must* not be compared or two structures which "should" be equal
>won't compare equal.

Yes but it's no harder for the compiler than it is for the programmer
except for unions (see more below).

>Yes, the comparison would be nice, but it's not trivial to implement and
>won't compile into a simple comparison.  Also, < and > are flatly impossible;
>what about
>
>	struct complex {
>		double	realpart;
>		double	imagpart;
>	};
>
>Any result of < and > would be misleading as the complex numbers aren't an
>ordered field.

Granted that '<' and '>' would be misleading in certain cases, I suggest
lexicographic comparison.  It would still be useful in many cases, for
arrays as well as structures.  If a programmer insisted on using it where
it was meaningless (e.g., complex numbers) it would be no worse than many
features "C" and several other languages have (GIGO).

>I'm sorry I ever slipped and understated the situation!  When I said that
>comparing structures is "hard" I meant that it's hard enough to be
>impractical.  Unions are perhaps the worst aspect, and you CAN'T avoid
>unions if you are going to compare structures, because a structure can
>contain a union.  Filling in a little bit of the reasoning:  The elements
>of a union can be of differing size.  If you compare two instances of a
>struct-containing-union, you have to know how much of the union to compare,
>to avoid a spurious mismatch on the unused trailing part of a short field.
>To do that, you have to know which field of the union is the current one,
>which in turn would mean that code would be required at EVERY assignment to
>a field of the union (plus or minus some very messy optimization) to keep
>track of the current field.  This isn't even remotely practical with C's
>undiscriminated union.

Another solution would be to clear the unused parts of the union wherever
a store is done.  In many cases (admittedly not all) this would be very
cheap.

>[OK, at least 5% of you are ready to suggest allowing "comparisons of
>structures unless they contain unions".  Please don't.  First, this is
>pretty clearly a clunky special case we don't need - but it also makes a
>mess of a certain sort of data-hiding:
>	struct junk {
>		int a,b,c,d;
>		mystery xyzzy;
>	}
>where "mystery" is a typedef from elsewhere.  The "junk" type shouldn't
>have to care what's in type "mystery"; it just needs one.  (Data
>abstraction and all that apple pie...)  However, the semantics of objects
>of type "struct junk" - in particular, whether they can be compared - would
>depend on whether "mystery" was, or contained, a union.]

Your point is well taken but I wonder if you realize that you've also
made it IMPOSSIBLE for the programmer to compare two structures of type
"junk".  It's in just such cases that structure comparison would be the
most useful.

I agree that there are some very hard problems but I think we need to give
it more thought and conversation before we dismiss it as impractical.  The
point I see is that structures NEED to be compared sometimes, is it better
for the compiler to make it easy for the programmer or not?  Except for
the problem with unions, the compiler has as much information about the
structure as the programmer does--often more.

			Regards,
				Neal Weidenhofer
				Denelcor, Inc.
				<hao|csu-cs|brl-bmd>!denelcor!neal

jas@druxy.UUCP (ShanklandJA) (03/23/84)

For goodness' sake!  When you need to do structure or array
comparison, how hard is it to write a little macro that does
the job?  Under most circumstances, I can't see it taking more
than 5 minutes.  (Yes, people sometimes use structures with
dozens -- or hundreds of fields.  But how often?  How often do
such structures need to be compared in their entirety?  Is it worth
hacking up the language to accomodate such cases?

The advantages of the macro approach are:
(1)  you get to interpret "equality" any way you want:  dictionary
     order for some fields, numerical equality for others, etc.;
     pointers must point to one and the same object to be equal,
     or pointers to equal-valued objects are considered equal
     (and of course, you get to decide what makes the pointed-to
     objects "equal");
(2)  you stop creeping featurism in its tracks.

One objection I foresee is that people will say that the compiler
could generate more efficient comparison code directly, say by
emitting a single block-compare instruction to compare several
truly adjacent fields ("truly adjacent" means no gap for alignment),
whereas if you compare each field separately in a macro, the compiler
will most likely emit separate compare instructions for each field.
But I suggest the solution to that is to use a compiler that will optimize
such code sequences into a single block-compare wherever possible.
Granted, I'm not at all sure that a C compiler that optimizes to
that extent exists anywhere, on any machine; but just wait....

Keep C small!

Jim Shankland
..!ihnp4!druxy!jas

hom@hocda.UUCP (H.MORRIS) (03/24/84)

Having a compiler that recognizes sequences like
if(x.y.z.elt1 == t.y.z.elt1 && x.y.z.elt2 == t.y.z.elt2 ...)
and generates a single block compare if appropriate
instead of implementing struct compares
*because we want to keep C simple*
would seem to me to defeate the purpose.

gwyn@brl-vgr.ARPA (Doug Gwyn ) (03/25/84)

I think the real problem with struct comparison is that there is no
"natural" order relation among structs of the same kind, except for
exact equality.  This is in contrast to real numbers where there is
a natural order defined.  (For an illustration, consider a "complex
number" struct.  Ask the nearest mathematician how to determine which
of two complex numbers is the "smallest".  If he gives you any
algorithm at all, it will probably involve the moduli of the complex
numbers.  How is the compiler going to know all this?)

I have the same objection to overloading of operators (as in Ada).
Just because one can use the same symbol (e.g. "+") to combine two
objects in each of several different classes by no means implies
that it is the "same" operation in the different classes, or even a
unique extension.  (Example:  a Boolean "+" could be a "union"; it
could also be a "Boolean sum".  Which is "correct"?)

tbray@mprvaxa.UUCP (Tim Bray) (03/25/84)

x <-- eat me

This response is provoked by Tim Maroney's suggestion that built-in
comparison of of arrays could be implemented in the case where the
length is not known by assuming a null element terminator.

                ************WRONG*************

The whole null-termination concept is a misfeature (yes, the costs
outweigh the benefits) and its use should be avoided in built-in
whatevers.

Other submittors have pointed out the problems which make structure/array
built-ins impractical to implement.  There is, however, one way to 
sidestep these problems:  Use a descriptor architecture.
For strings, the descriptor encodes the address and length (that's
right, no more explaining to people that you need 13 bytes to store
a 12-character string - and you can now embed null bytes in strings!)
You could also pass arrays around by descriptor and the receiving
code would not need to know in advance the size, but could determine
it unambiguously.

However, existing UNIX(tm) software is so heavily married to the
concept of null termination that it ain't gonna happen.  And look,
if wrestling with the UNIX string subroutines is the worst programming
problem you have, stop complaining... you could be composing DCB's for
IEBGENER.

Tim Bray

donn@sdchema.UUCP (03/26/84)

This debate is over the wrong issues.

What I hear some people saying is, 'Wow!  My machine has a whiz-bang
block compare instruction!  If only my compiler could generate it, my
code would be SO much faster!'  People on the one side of the fence
seem to think that the only way to get this into the compiler is to
make a built-in language construct like structure comparison, while
people on the other side are willing to live with a function call if it
means keeping the syntax of C safe from questionable additions.

How about meeting halfway?  The Berkeley folks often use a
post-processing 'sed' script which takes nasty things like splx() and
ntohl() and produces in-line code.  While I'm sure we all groan at the
grossness of this kind of intervention, why not take the grossness away
and let the compiler have access to a 'library' of in-line assembly
language functions?

A macro-like syntax for these functions could be developed so that
there would be a standard way of adding to the library when desired.
No C compiler would be required to implement the functions in-line;
thus no C syntax or semantics would change.  Any compiler that didn't
implement in-line functions would simply cause programs to use the
standard C library.

Thus we might see pseudo-C definitions like:

/*
 * strlen( s ) -- Return length of s.  VAX inline code.
 */
inline int strlen( s ) {
	register r

	movl	s,r
	locc	$0,$0x7fffffff,(r)	/* Search for null char in s */
	subl2	r,r0			/* Subtract s from addr of end of s */
}

(The intent of the 'register' declaration would be to ask the compiler
to make a register available, by using one that is free or by saving
one on the stack and restoring it later.)

This would make string manipulating 'functions' just as fast as the
proposed string operators, but the idea is more general and C's syntax
doesn't suffer.  Structure comparisons would be done quickly by a
simple bcmp(), which might cause an in-line 'cmpc3' instruction to be
emitted on a VAX or a subroutine call on a PDP11.  (Or a tight loop --
the arrangement is flexible enough that anything you could code
yourself in assembler, you could make available to the compiler to
improve your C code.)  No smarts are needed in the compiler to spot
loops that could be compiled into exotic instructions.

I was hoping to hear that this sort of thing was possible with PCC 2
when I went to Steve (?) Johnson's talk at Uniforum, since there was
something in the abstract that looked like it might be it.  Nothing
came up in the talk and things were too busy afterwards for me to get
to buttonhole him in the hall.  Is anyone out there familiar enough
with PCC 2 to say whether in-line assembler functions are present?

Too much wishful thinking,

Donn Seeley    UCSD Chemistry Dept.       ucbvax!sdcsvax!sdchema!donn

hom@hocda.UUCP (H.MORRIS) (03/27/84)

One should not confuse the issues of implementing >, <, etc for
structs with that of implementing simple equality (and !=).
I categorically think that > and < should NOT be implemented for
structures.  But I think since we have assignment, it makes good
sense to define == and != for structures.

gwyn@brl-vgr.ARPA (Doug Gwyn ) (03/27/84)

String descriptors and null-terminated strings are two competing
approaches to string design.  Each has strong and weak points.
However, as a user of both I would hate to see the user forced to
use descriptor-based strings in C.  Terminated-arrays are much
easier to deal with in C code in the majority of cases.

lwall@sdcrdcf.UUCP (Larry Wall) (03/27/84)

In article <2831@brl-vgr.ARPA> gwyn@brl-vgr.ARPA (Doug Gwyn ) writes:
>...
>I have the same objection to overloading of operators (as in Ada).
>Just because one can use the same symbol (e.g. "+") to combine two
>objects in each of several different classes by no means implies
>that it is the "same" operation in the different classes, or even a
>unique extension.  (Example:  a Boolean "+" could be a "union"; it
>could also be a "Boolean sum".  Which is "correct"?)

Just what kind of an idealistic totalitarian are you?  By this kind of
argument we can rid the world of freedom forever.  Just because a freedom
can be abused doesn't mean we should do away with it.

GIVE ME OVERLOADING, OR GIVE ME DEATH!

Seriously, there are some very good arguments in favor of overloading.

1)  Almost all languages do it surreptitiously already (e.g. what is the
    type of "+" in C?).

2)  The human mind thrives in the presence of overloading and other
    context-dependent activity.  Consult any dictionary for examples.

3)  It helps keep the namespace uncluttered.  No need to make up silly names
    just because you want to do the same (or similar) thing to a different
    type of object.  "Martha, I want to can-open this can."

4)  It facilitates "generic" abstractions.

What a Boolean "+" might mean depends on where I get the definition from,
now doesn't it?  Anyone who uses an operation with semantics they don't know
is asking for it anyway, regardless of whether the semantics come from
a "package" or are built into the language.  If this is indeed the case,
why not release language designers from having to predict ahead of time
all the (reasonable) uses that someone might want to put an operator to,
and give some modicum of freedom to package designers instead.  I assure
you, if the package designers misuse the freedom, people will tend not to
use those packages.

Thank goodness Ada isn't "perfect".  Just a little more perfectible.

Larry Wall
{allegra,burdvax,cbosgd,hplabs,ihnp4,sdcsvax}!sdcrdcf!lwall
(I'm glad my organization doesn't make me put a stupid disclaimer down
here, which no one ever reads anyway, for reasons obvious to all but a few
bosses...  After all, when's the last time you saw an ORGANIZATION have an
opinion on the net?)

scw@cepu.UUCP (03/27/84)

Sounds like a *GASP* PL/I dope vector to me.
-- 
Stephen C. Woods (VA Wadsworth Med Ctr./UCLA Dept. of Neurology)
uucp:	{ {ihnp4, uiucdcs}!bradley, hao, trwrb, sdcsvax!bmcg}!cepu!scw
ARPA: cepu!scw@ucla-locus
location: N 34 06'37" W 118 25'43"

dwr@ccieng6.UUCP ( Donald Wallace Rouse II) (03/29/84)

The version of C that I use allows compile-time initialization of the
non-union elements of non-automatic structures, as:

	static struct s
		{
		int	e1;
		char	e2;
		union
			{
			int	e31;
			double	e32;
			} e3;
		} x = { 1, 'c' };

If C special-cases initialization in this way, there is no reason it
can't special-case comparisons of structures containing unions.  I
agree that comparisons other than == != are probably not too useful.
Don't worry about holes; let the compiler do it.  If the compiler can
keep track of holes during initialization, there is no reason that it
can't keep track of them during comparison and generate appropriate
code to skip them.

On a slightly related topic, my favorite extension to C would be
structure and array constants on the right-hand side of assignment
expressions, as:

	int x ()
		{
		struct s	x;
		int		i [5];

		if (expr)
			x = { 4, 'x' };
		else
			i = { 1, 2, 3, 4, 5 };
		}
									D2

rcd@opus.UUCP (03/30/84)

<>
Boy, this would-be feature doesn't want to die...
>>Any result of < and > would be misleading as the complex numbers aren't an
>>ordered field.
>
>Granted that '<' and '>' would be misleading in certain cases, I suggest
>lexicographic comparison.  It would still be useful in many cases, for
>arrays as well as structures...
Arrays might or might not be regarded as having an ordering property.
Structures, both philosophically and in general usage, are almost always
regarded as cartesian products of their components, hence inherently
unordered.  What's the point of defining ordering when the abstraction
modeled doesn't have the property?  Is it really so useful?

>>(incredibly verbose explanation (mine) of why comparison is very costly
>>if struct contains a union - see referenced article)
>Another solution would be to clear the unused parts of the union wherever
>a store is done.  In many cases (admittedly not all) this would be very
>cheap.
This is just not so.  It implies that any store into a field (or any
subfield) of a union requires clearing the unused tail for that field,
which can give you a 2:1 increase in the cost of a simple store for some
very common cases.  Moreover, I claim there's no implementation which is
reasonable and completely correct, because you can generate pointers to the
fields of the union (and this includes actual union-field parameters sent
with &) and store into the union indirectly.  (I'm beginning to wonder
whether this is a contest to see if I tire of constructing counterexamples
to proposals!)
>>(example in which a struct type "junk" may or may not be comparable
>>depending on whether it contains a union)
>Your point is well taken but I wonder if you realize that you've also
>made it IMPOSSIBLE for the programmer to compare two structures of type
>"junk"...
Yes, I realize that completely.  I even like it.  I like language rules
that say things are either possible or impossible under all circumstances
much better than I like rules that say that some things are possible
sometimes.  Special cases make the language harder to learn and use, and
make compilers larger, slower, harder to write and harder to make correct.

>I agree that there are some very hard problems but I think we need to give
>it more thought and conversation before we dismiss it as impractical.  The
>point I see is that structures NEED to be compared sometimes, is it better
>for the compiler to make it easy for the programmer or not?...
And the point that I see is that, given the desire to achieve a not-very-
large language, what should go in and what should stay out?  Even more
important, since the discussion started around changes to C, another
question is whether a change such as structure comparison is useful enough
(ignoring implementability for the moment) to justify changing the
language.  Remember that even for an upward-compatible change you must
change existing compilers.
-- 
{hao,ucbvax,allegra}!nbires!rcd

rcd@opus.UUCP (04/01/84)

<>
> Except for
> the problem with unions, the compiler has as much information about the
> structure as the programmer does--often more.

Actually, this isn't true either.  If you think about situations in which
you want to compare structures, it is probably true a fair amount of the
time that the comparisons should be done in a particular order.  That is,
you will want the effect of
	if ((a.f1==b.f1) && (a.f2==b.f2) && ...)
where the compiler would not necessarily choose to compare f1 before f2.
The point is that the programmer may well have more information - he may be
able to decide which fields mismatch more often and arrange to compare them
first.

[An aside:  I would conjecture that the efficiency gains from having the
compiler use special comparison instructions would probably be less overall
than the efficiency gains from having the programmer specify the
comparisons in optimal order.]
-- 
{hao,ucbvax,allegra}!nbires!rcd