[comp.lang.c] Is an object made up of bytes?

msb@sq.UUCP (01/16/87)

Richard Stallman says [in effect]:
>  I am not sure whether the standard implies that, given "short in, out;",
> 
> {   char *inptr, *outptr; int i;
>     inptr = (char *) &in; outptr = (char *) &out;
>     for (i = 0; i < sizeof (short); i++) outptr[i] = inptr[i];   }
> 
>  is defined and equivalent to "out = in;".

and Doug Gwyn replies:
$ No, this can't be guaranteed. For example, there may be bits
$ in the short that are not covered by its chars.

I'm pretty sure this is wrong.  The draft proposed standard says:

* Section 1.5, page 2, lines 33-34 and 38-40:

# Byte - the unit of data storage in the execution environment large
# enough to hold a single character in the character set of the execution
# environment.
  ...

# Except for bit-fields, objects are composed of contiguous sequences
# of one or more bytes, the number, order, and encoding of which are
# implementation-defined (except where explicitly specified).

That rules out the interpretation Doug gives, and the following seems
to me to lock in the fact that the above code segments ARE equivalent
(barring interrupts):

* Section 3.1.2.5, page 20, lines 47-48:

# An object declared as a character (char) is large enough to store
# any member of the execution character set. ...

* Section 3.3.3.4, page 38, lines 13-16:

# The sizeof operator yields the size (in bytes) of its operand, which
# may be an expression or the parenthesized name of a type.
#
# When applied to an operand that has type char, unsigned char, or
# signed char, the result is 1. ...

Of course, this means that the draft proposed standard disallows any
implementation using, say, 7-bit chars and 36-bit ints, which might be
desirable on DECsystem-10's; but I think it is reasonable to do so, just
as it is reasonable to disallow non-binary machines.  Too much C assumes
the underlying model stated in section 1.5 to proceed otherwise.

Mark Brader	  "I'm not a lawyer, but I'm pedantic and that's just as good."
utzoo!sq!msb						       -- D Gary Grady

gwyn@brl-smoke.ARPA (Doug Gwyn ) (01/17/87)

In article <1987Jan15.215225.9688@sq.uucp> msb@sq.UUCP (Mark Brader) writes:
>Richard Stallman says [in effect]:
>>  I am not sure whether the standard implies that, given "short in, out;",
>> {   char *inptr, *outptr; int i;
>>     inptr = (char *) &in; outptr = (char *) &out;
>>     for (i = 0; i < sizeof (short); i++) outptr[i] = inptr[i];   }
>>  is defined and equivalent to "out = in;".
>and Doug Gwyn replies:
>$ No, this can't be guaranteed. For example, there may be bits
>$ in the short that are not covered by its chars.
>I'm pretty sure this is wrong.  The draft proposed standard says:

Mark is, I think, correct in his assessment of the nature of bytes
in the X3J11 model of C objects.  However, I had something else in
mind but due to interruptions while preparing my response I didn't
get it worded correctly.  (The extra bits I had in mind were tag
bits; see below for a corrected version.)  I'll try again..

The things that prevent RMS's approach from working portably are:

The semantics of "(char *) &object" aren't guaranteed to produce anything
that can be safely dereferenced to access a char.  The only guarantee is
that the opposite conversion can be made subsequently without losing
information.  This can be an issue for machines that don't support byte
addressing; to keep pointer arithmetic simple, the high-order bits of a
pointer may indicate the size of its dereferenced type; in such a case, if
the cast is merely a word transfer without the bits being shifted and
otherwise rearranged, the cast (char *) does not produce a useful address.

Even if the resulting char pointer designates a char, it might not be the
char that one would guess.  On "little endian" machines it probably would
be, but there may be "big endian" byte-addressed architectures where the
numeric address of a word is not the lowest-valued address of the bytes
within the word; in this case the loop in the example would copy the wrong
collection of bytes (assuming again that the cast is implemented as a
simple word transfer without being rearranged specifically to make such
examples work, which would involve additional overhead).

In a tagged architecture, the pointed-at object may not be referenced as
the wrong type without causing a machine trap.

In general, I believe X3J11 intended to strongly discourage ANY reliance on
"type punning".

P.S.  Upon re-reading 3.3.4 Semantics, I see that RMS and I interpreted the
use of the word "may" differently.  Comparison with other sections of the
document now leads me to believe that RMS was probably correct in thinking
that pointer<->integer conversion via casts MUST be supported by a
conforming implementation, although enough is left "implementation-defined"
that an implementation could choose to make this a useless operation.  This
means that some restriction on use of externs in initializers really is
necessary (to prevent having to support complete C-arithmetic in linkers)
if the typical implementation is to give useful meaning to such conversions.
This deficiency in the draft standard needs to be fixed.

gwyn@brl-smoke.ARPA (Doug Gwyn ) (01/18/87)

> The semantics of "(char *) &object" aren't guaranteed to produce anything
> that can be safely dereferenced to access a char.
> ... there may be "big endian" byte-addressed architectures where the
> numeric address of a word is not the lowest-valued address of the bytes
> within the word ...

I occurs to me that X3J11 needs to add a guarantee that at least a cast
to (void *) results in something representing the lowest-valued address
of any byte in the object pointed at by whatever pointer is being
converted; otherwise what good are the mem*() functions?

Some of the issues raised by RMS are deeper than I at first realized.

brett@wjvax.UUCP (Brett Galloway) (01/21/87)

In article <5536@brl-smoke.ARPA> gwyn@brl-smoke.ARPA (Doug Gwyn ) writes:
>> The semantics of "(char *) &object" aren't guaranteed to produce anything
>> that can be safely dereferenced to access a char.
>> ... there may be "big endian" byte-addressed architectures where the
>> numeric address of a word is not the lowest-valued address of the bytes
>> within the word ...
>
>I occurs to me that X3J11 needs to add a guarantee that at least a cast
>to (void *) results in something representing the lowest-valued address
>of any byte in the object pointed at by whatever pointer is being
>converted; otherwise what good are the mem*() functions?
>
>Some of the issues raised by RMS are deeper than I at first realized.

I agree.  It seems odd, though, that (void *) and (char *) would behave
differently.  I know that there is a lot of existing code that assumes
that (char *) behaves this way.  This assumption is necessary because
(void *) doesn't exist, and bcopy (or memc?py) on (char *)&foo is too
useful.  Another example is writing data to a file -- how could one ever
write anything but a character string to a file?  For example,
to write an object
	short foo;
to a file, one must do something like
	fwrite((char *)&foo,sizeof(short),1,file)
at least in 4.2BSD.  In order to maintain this ability, it must be
possible to obtain the "numeric address of an object which is the
lowest-valued address of the bytes within the object."  One could make
(void *) this object, but that is still not correct -- fwrite needs to
dereference the pointer to the object to get characters, not "voids".
I suppose one could do (char *)(void *)(&foo), but that is ugly.
-- 
-------------
Brett Galloway
{pesnta,twg,ios,qubix,turtlevax,tymix,vecpyr,certes,isi}!wjvax!brett

gwyn@brl-smoke.UUCP (01/24/87)

In article <812@wjvax.wjvax.UUCP> brett@wjvax.UUCP (Brett Galloway) writes:
>I agree.  It seems odd, though, that (void *) and (char *) would behave
>differently.  I know that there is a lot of existing code that assumes
>that (char *) behaves this way.  This assumption is necessary because
>(void *) doesn't exist, and bcopy (or memc?py) on (char *)&foo is too
>useful.  Another example is writing data to a file -- how could one ever
>write anything but a character string to a file?  For example,
>to write an object
>	short foo;
>to a file, one must do something like
>	fwrite((char *)&foo,sizeof(short),1,file)
>at least in 4.2BSD.  In order to maintain this ability, it must be
>possible to obtain the "numeric address of an object which is the
>lowest-valued address of the bytes within the object."  One could make
>(void *) this object, but that is still not correct -- fwrite needs to
>dereference the pointer to the object to get characters, not "voids".
>I suppose one could do (char *)(void *)(&foo), but that is ugly.

Two points:

Actually I agree with the gist of your comments.  I briefly checked
this with Larry Rosler (the X3J11 Redactor) at USENIX, and my impression
of the outcome of our discussion is that X3J11 certainly intends that
the conversion (char *) produces the address of the lowest-addressed char
of the original referenced object.  However, I wasn't able to find an
explicit guarantee of this in the draft proposed standard.  This seems
like an oversight that needs to be remedied.

You should also realize that I am a proponent of a modification to the
standard to support chars that are more than one "byte" (basic storage
object accessible unit).  When I was drafting my proposal on this issue,
I observed the interesting phenomenon that all the void * parameters
in the draft referred to basic storage units and all the char *s were
used for actual character text, except for fread/fwrite which I think
need to be restricted to binary objects (basic storage units).  (I
would appreciate hearing of any significant use of these with textual
data.)  Since void * plays the role of a "lowest-common-denominator"
(i.e., generic) pointer, it is appropriate for it to have magical
properties, but if we establish the common idea of how chars (or other
byte-sized objects, in my scheme) are packed inside objects, there is
then no need for void * to be singled out specially in this regard.