[comp.unix.wizards] pointer alignment when int != char *

meissner@xyzzy.UUCP (Michael Meissner) (01/01/70)

In article <13676@topaz.rutgers.edu> ron@topaz.rutgers.edu (Ron Natalie) writes:
> More correctly stated that if your the difference between two pointers is
> ever more than that that can be represented by a long, you are in trouble.
> 
> -Ron

Actually, to be more precise, if the difference between two pointers which
point to members of the SAME array is ever more than that which can be
represented by a long, you are in trouble.  All of the standards say that
pointer subtraction is only defined within an aggregate.  This allows putting
each top level item into a separate segment on say an 80*86, and only doing
the subtraction between the two offsets.  Many MSDOS compilers do this
already.
-- 
Michael Meissner, Data General.		Uucp: ...!mcnc!rti!xyzzy!meissner

simon@its63b.ed.ac.uk (Simon Brown) (06/22/87)

If I were to want to implement malloc (or some such) on a machine where
sizeof(int) != sizeof(char *), how do I ensure that the pointer-values I
return are maximally aligned (eg, quad-aligned)? If sizeof(int)==sizeof(char *),
then I can cast the pointer to an int, do whatever arithmetic stuff is
required to it to get it to be aligned, then cast it back again - but of
course this won't work if information is lost by either of the casts.

Any hints?

(BTW, It's not really malloc I'm dealing with, I just lied about that one)

	%{
	    Simon!
	%}


-- 
----------------------------------
| Simon Brown 		         | UUCP:  seismo!mcvax!ukc!its63b!simon
| Department of Computer Science | JANET: simon@uk.ac.ed.its63b
| University of Edinburgh,       | ARPA:  simon%its63b.ed.ac.uk@cs.ucl.ac.uk
| Scotland, UK.			 |
----------------------------------	 "Life's like that, you know"

blarson@castor.usc.edu (Bob Larson) (07/05/87)

In article <493@its63b.ed.ac.uk> simon@its63b.ed.ac.uk (Simon Brown) writes:
>If I were to want to implement malloc (or some such) on a machine where
>sizeof(int) != sizeof(char *), how do I ensure that the pointer-values I
>return are maximally aligned (eg, quad-aligned)?

The same way as you would on any other machine: non-portably.
(What is your definition of quad-aligned?  4 * sizeof(char)?  There
are quit a few machines where this is not maximally aligned.)

For example, prime 64v mode:

char *alignpointer(p)
char *p;
{
    union {
	char *up;
	struct {
	    unsigned fault:1;
	    unsigned ring:2;	/* I may have the ring and extend bits exchanged */
	    unsigned extend:1;	/* check before you try this on a real prime */
	    unsigned segment:12;
	    unsigned offset:16;
	    unsigned bit:4;
	    unsigned unused:12;
	} point;
    } un;
    int zerooffset;

    un.up = p;
    if(un.point.fault) return p; /* faulted pointer, not a valid address */
    zerooffset = un.point.offset == 0;
    un.point.offset = (un.point.offset | (un.point.extend & (un.point.bit!=0))
	 + 1) & ~1;	/* round offset up to next 4 byte boundry */
    un.point.extend = 0;/* and say it is at a 2-byte boundry */
    un.point.bit = 0;	/* unneded, but leave it clean */
    if(un.point.offset == 0 && !zerooffset) un.point.segment++;
    return un.up;
}

Obviously, it would be easier to make sure to generate aligned pointers
in the first place.  Also I did not make all the assumptions that the C compiler
does, assuming you could have gotten the pointer via another language.
Bob Larson		Arpa: Blarson@Ecla.Usc.Edu
Uucp: {sdcrdcf,seismo!cit-vax}!oberon!castor!blarson
"How well do we use our freedom to choose the illusions we create?" -- Timbuk3

gwyn@brl-smoke.ARPA (Doug Gwyn ) (07/05/87)

In article <493@its63b.ed.ac.uk> simon@its63b.ed.ac.uk (Simon Brown) writes:
>If I were to want to implement malloc (or some such) on a machine where
>sizeof(int) != sizeof(char *), how do I ensure that the pointer-values I
>return are maximally aligned (eg, quad-aligned)? If sizeof(int)==sizeof(char *),
>then I can cast the pointer to an int, do whatever arithmetic stuff is
>required to it to get it to be aligned, then cast it back again - but of
>course this won't work if information is lost by either of the casts.

First, do most of your arithmetic on (char *) data types, not on (int)s.

Second, forcing alignment may require converting your pointers to
integral types to do the rounding operations.  (long) is appropriate
for portable code.  (If a (char *) won't fit into a (long), you have
real problems!)

Third, it is difficult to portably determine alignment requirements.
Consider using something like the following:
	struct align
	{
		char	c0;
		union
		{
			long	l1[2];
			double	d1[2];
			char	*cp1[2];
			union
			{
				long	l2[2];
				double	d2[2];
				char	*cp2[2];
			}	u1[2];
		}	u0;
	}	a;
	#define	ALIGN	((char *)&a.u0 - (char *)&a.c0)
(This example can probably be improved.)

lm@cottage.WISC.EDU (Larry McVoy) (07/06/87)

In article <6061@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
>integral types to do the rounding operations.  (long) is appropriate
>for portable code.  (If a (char *) won't fit into a (long), you have
>real problems!)

I'm not sure this is true anymore.  Don't some supercomputers make
longs 32 bits, long longs 64 bits, and have addresses > 32 bits and < 64 bits?
I seem to remember that someone said something like that recently.

Larry McVoy 	        lm@cottage.wisc.edu  or  uwvax!mcvoy

fu@hc.DSPO.GOV (Castor L. Fu) (07/06/87)

In article <3812@spool.WISC.EDU> lm@cottage.WISC.EDU (Larry McVoy) writes:
>In article <6061@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
>>integral types to do the rounding operations.  (long) is appropriate
>>for portable code.  (If a (char *) won't fit into a (long), you have
>>real problems!)
>
>I'm not sure this is true anymore.  Don't some supercomputers make
>longs 32 bits, long longs 64 bits, and have addresses > 32 bits and < 64 bits?
>I seem to remember that someone said something like that recently.
>
>Larry McVoy 	        lm@cottage.wisc.edu  or  uwvax!mcvoy

Well, I am not positive about how the C compiler is organized, 
(who wants to use a compiler which can barely vectorize on a cray?)
However,  the FORTRAN compiler's primary data type for integers is
64 bits wide.  Internally, the addressing registers are only 24 bits
wide.  (The machine has no virtual memory, and 24 bits addresses 
16 megawords which is still 128Mbytes, so the need for 32 bit or 64 bit
addressing is questionable.)  Anyways this has lead to much grief for
myself when I found library routines which never expected to see
things bigger than 8 Megwords (since the integers are signed.).

So I guess the moral of the story is that
sizeof ( char *) < sizeof(int)  is also quite possible in some wierd
implementations.

				-Castor Fu
				fu@hc.dspo.gov

gwyn@brl-smoke.ARPA (Doug Gwyn ) (07/06/87)

In article <3812@spool.WISC.EDU> lm@cottage.WISC.EDU (Larry McVoy) writes:
-In article <6061@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
->integral types to do the rounding operations.  (long) is appropriate
->for portable code.  (If a (char *) won't fit into a (long), you have
->real problems!)
-
-I'm not sure this is true anymore.  Don't some supercomputers make
-longs 32 bits, long longs 64 bits, and have addresses > 32 bits and < 64 bits?
-I seem to remember that someone said something like that recently.

What's a (long long)?  We were talking about portable code!

lm@cottage.WISC.EDU (Larry McVoy) (07/06/87)

I sez:
  I'm not sure this is true anymore.  Don't some supercomputers make
  longs 32 bits, long longs 64 bits, and have addresses > 32 bits and < 64 bits?
  I seem to remember that someone said something like that recently.

Doug sez:
  What's a (long long)?  We were talking about portable code!

A long long is a kludge.  However, I seem to remember that it went something
like this:  a company was doing unix on a Amdahl (???) and the unix people
were really used to (xxx *) == 32 bits and (long) == 32 bits, and having
it otherwise broke all sorts of code.  So they gave people short, int, long,
and long long.  Yeah, it's gross.  But so was defining C in such an 
ambiguous way.   It's really time for int8 int16 int32 int64 or some such
attempt at defining sizes with the type.

Larry McVoy 	        lm@cottage.wisc.edu  or  uwvax!mcvoy

karl@haddock.UUCP (Karl Heuer) (07/07/87)

In article <3812@spool.WISC.EDU> lm@cottage.WISC.EDU (Larry McVoy) writes:
>In article <6061@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn) writes:
>>(long) is appropriate for portable code.  (If a (char *) won't fit into a
>>(long), you have real problems!)

Hasn't ANSI removed all pretense of pointers being integerizable?

>I'm not sure this is true anymore.  Don't some supercomputers make
>longs 32 bits, long longs 64 bits, and have addresses > 32 bits and < 64 bits?
>I seem to remember that someone said something like that recently.

Probably my article, which was hypothetical.  I was less concerned with the
cast of pointer to int, which is nonportable anyway, than with the kosherness
of having size_t and ptrdiff_t be larger than unsigned long.

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

ron@topaz.rutgers.edu.UUCP (07/08/87)

That is hideous.  I don't know what supercomputer you are referring
to but Crays have ints and longs both at 64 bits.  There are no super-longs.
When we did the compilers for the HEP Supercomputer (64 bit words),
we opted for 16 bit shorts, 64 bit ints,  and 64 bit longs.  There is
one more hardware supported type (half words-32 bits).  Avoiding things
that would really warp the language such as short long ints or long short
ints, and realizing that we really wanted int to be 64 bits (the convenient
size as stated in K&R and the standards), we settled for a seperate "hidden"
type that we try to avoid using except when necessary.  It was called
_int32, though the term "medium int" did come up in discussion.  By the
way, it was a real pain hacking pcc to do the extra int type.

-Ron

davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen Jr) (07/10/87)

In article <13218@topaz.rutgers.edu> ron@topaz.rutgers.edu (Ron Natalie) writes:
: That is hideous.  I don't know what supercomputer you are referring
: to but Crays have ints and longs both at 64 bits.  There are no super-longs.
: When we did the compilers for the HEP Supercomputer (64 bit words),
: we opted for 16 bit shorts, 64 bit ints,  and 64 bit longs.  There is
: one more hardware supported type (half words-32 bits).  Avoiding things...

Why not have int be 32 bits? That fits the requirement that
length char<=short<=int<=long. Not a comment, just a question...
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {chinet | philabs | sesimo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

gwyn@brl-smoke.ARPA (Doug Gwyn ) (07/12/87)

In article <6655@steinmetz.steinmetz.UUCP> davidsen@kbsvax.steinmetz.UUCP (William E. Davidsen Jr) writes:
>Why not have int be 32 bits? That fits the requirement that
>length char<=short<=int<=long. Not a comment, just a question...

There are two main considerations for the correct size to be used for (int)
when implementing C on a new system:

1.  (int) objects should be accessible quickly.  On a word-addressed
architecture, this argues for making them full words.

2.  (int)s must be usable for indexing arrays.  Depending on the address
space, one may have to either impose an artificial limit on array sizes
or else make (int)s longer than they might have been.  For example, on
a hypothetical PDP-11AX (which doesn't exist because it turned into a VAX),
one could have had 16 bits continue to be the natural integer data size but
24 or 32 bits could have been the preferred pointer size due to an extended
addressing scheme using base registers a la Gould.  The C implementor would
almost certainly have wanted to make the larger address space available on
such a machine, which would force some sort of accommodation to be made for
indexing char arrays -- probably by making (int)s as wide as char pointers.

I understand from hearsay that the IBM PC world (actually the Intel 8086
world) ran against this very problem, and instead of making a single sane
choice they ended up proliferating a variety of incompatible sets of
choices (hilariously called "models").  One hopes that a lesson was learned,
but I doubt it.

ron@topaz.rutgers.edu (Ron Natalie) (07/13/87)

> : When we did the compilers for the HEP Supercomputer (64 bit words),
> : we opted for 16 bit shorts, 64 bit ints,  and 64 bit longs.  There is
> : one more hardware supported type (half words-32 bits).  Avoiding things...

> Why not have int be 32 bits? That fits the requirement that
> length char<=short<=int<=long. Not a comment, just a question...

Because "int" is supposed to be a convenient size.  The convenient size for
us is 64 bits.  Since the largest number of variables are type "int" you
want to use something pretty efficient (like the word size).

By they way, you assumption that type "char" has some guaranteed relationship
to any of the integer types is wrong, although anyone who has "char"s that
aren't exactly eight bits is likely to cause many applications to die.

-Ron

davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen Jr) (07/15/87)

In article <6110@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
>In article <6655@steinmetz.steinmetz.UUCP> davidsen@kbsvax.steinmetz.UUCP (William E. Davidsen Jr) writes:
>>Why not have int be 32 bits? That fits the requirement that
>>length char<=short<=int<=long. Not a comment, just a question...
>
>There are two main considerations for the correct size to be used for (int)
>when implementing C on a new system:
>
>1.  (int) objects should be accessible quickly.  On a word-addressed
>architecture, this argues for making them full words.

[ I thought you mentioned that the 32 bit size was hardware supported.
On many machines the short math is faster than long (ie. vax, 68000). ]
>
>2.  (int)s must be usable for indexing arrays.  Depending on the address
>space, one may have to either impose an artificial limit on array sizes
>or else make (int)s longer than they might have been.  For example, on

[ The 32 bit size allows an acceptable range as a subscript, although at
some point 4GB won't be enough, most of the problems using big memory
are also using multiple arrays less than 2GB. ]

>I understand from hearsay that the IBM PC world (actually the Intel 8086
>world) ran against this very problem, and instead of making a single sane
>choice they ended up proliferating a variety of incompatible sets of
>choices (hilariously called "models").  One hopes that a lesson was learned,
>but I doubt it.

That's the point I was making in my posting... the problems occur when
the int won't hold an address, and then mainly because some <deleted> is
playing fast & loose with bit fidling in pointers or some such. The
major problems with "models" would go away if someone made the large
model int the same length as the large model pointer.

I've been fighting with this in pathalias, trying to get it to run on an
80*86 machine, and finding that (a) it does all its own memory
allocation, and (b) it uses ints to hold addresses while doing it. This
kind of non-portable code will fail on machines which are not byte
addressed, and which use a pointer which looks like a word address and
character offset.

X3J11 covered this very well, pointers are not forced to be the size of
int, they are not even the size of each other! Code written for large
model 80*86 will almost always run on any other machine, assuming that
it doesn't use calls to the hardware, etc.
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {chinet | philabs | sesimo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

throopw@xyzzy.UUCP (Wayne A. Throop) (07/28/87)

> karl@haddock.UUCP (Karl Heuer)
>> lm@cottage.WISC.EDU (Larry McVoy)
>>> gwyn@brl.arpa (Doug Gwyn)
>>>(long) is appropriate for portable code.  (If a (char *) won't fit into a
>>>(long), you have real problems!)

I am aware of a seriously developed architecture where "long" was 64
bits, and pointers were 128 bits.  That is, arithmetic could be
performed on binary integers up to 64 bits long by the CPU, but pointers
had considerable extra information beyond offset information.  In
particular, there was a universal, shared, access-protected, segmented
address space.  It would have been natural to make shorts either 16 bits
or 32 bits, ints 32 bits, and longs 64 bits, which is quite vanilla.
The odd thing would have been that pointers wouldn't fit into any of
those.  But all in all, a very lovely machine.

And yes, much C code would have been hard to port to this machine, or
the compiler would have had to stand on its head and spin about 48
hula-hoops on its toes to make the usual assumptions that many C
programmers make about the underlying hardware seem to be true.  Sadly,
it is unlikely that this architecture will haunt C implementors or
programmers.  The current fashion in computer architecture has moved
away from many of the concepts it embodied.  Sigh.

>> Don't some supercomputers make longs 32 bits, long longs 64 bits, and
>> have addresses > 32 bits and < 64 bits?
>>I seem to remember that someone said something like that recently.
> Probably my article, which was hypothetical.  I was less concerned with the
> cast of pointer to int, which is nonportable anyway, than with the kosherness
> of having size_t and ptrdiff_t be larger than unsigned long.

Ah.  The architecture I had in mind does not have these problems.  Of
course, many C programmers assume that any two non-null poiners of the
same type can be subtracted, which isn't the case for this architecture.

--
What!!??  What is it!!??  
Did they find Jimmy Hoffa under Tammy Bakker's makeup?
                                --- from Bloom County
-- 
Wayne Throop      <the-known-world>!mcnc!rti!xyzzy!throopw

mark@applix.UUCP (Mark Fox) (07/29/87)

In article <161@xyzzy.UUCP> throopw@xyzzy.UUCP (Wayne A. Throop) writes:
%
%I am aware of a seriously developed architecture where "long" was 64
%bits, and pointers were 128 bits... In
%particular, there was a universal, shared, access-protected, segmented
%address space... But all in all, a very lovely machine... Sadly,
%it is unlikely that this architecture will haunt C implementors or
%programmers.  The current fashion in computer architecture has moved
%away from many of the concepts it embodied.  Sigh.
>-- 
>Wayne Throop      <the-known-world>!mcnc!rti!xyzzy!throopw

Ahh, DG's unforgettable FHP machine. What a dream that was. :-)

-- 
                                    Mark Fox
       Applix Inc., 112 Turnpike Road, Westboro, MA 01581, (617) 870-0300
                    uucp:  seismo!harvard!m2c!applix!mark

gwyn@brl-smoke.ARPA (Doug Gwyn ) (07/30/87)

In article <161@xyzzy.UUCP> throopw@xyzzy.UUCP (Wayne A. Throop) writes:
->>> gwyn@brl.arpa (Doug Gwyn)
->>>(long) is appropriate for portable code.  (If a (char *) won't fit into a
->>>(long), you have real problems!)
-I am aware of a seriously developed architecture where "long" was 64
-bits, and pointers were 128 bits.

Yup, you notice the dpANS for C doesn't guarantee that there will be an
integral type able to hold a pointer without loss of information.  It
does give rules for such a feature if it happens to be implemented, however.

ron@topaz.rutgers.edu (Ron Natalie) (08/04/87)

More correctly stated that if your the difference between two pointers is
ever more than that that can be represented by a long, you are in trouble.

-Ron

dhesi@bsu-cs.UUCP (Rahul Dhesi) (08/07/87)

In article <179@xyzzy.UUCP> meissner@nightmare.UUCP (Michael Meissner) writes:
>All of the standards say that
>pointer subtraction is only defined within an aggregate.  This allows putting
>each top level item into a separate segment on say an 80*86, and only doing
>the subtraction between the two offsets.  Many MSDOS compilers do this
>already.

To subtract two independent large-model pointers of the type
segment:offset, I tried this:

     (unsigned long) p2 - (unsigned long) p1

I was hoping that the cast to unsigned long would convert each pointer
to a sort of absolute memory address in bytes, and the subtraction
would yield the difference in bytes.  Under Borland's Turbo C at least,
such a cast is a no-op, so the resulting unsigned long does not
necessarily increase monotonically with increasing memory address to
which the original pointer points.

I understand that the requirement on such casts is that they be
unsurprising and reversible, to the extent that these are possible.  It
would be nice if "unsurprising" were interpreted to mean that the
subtraction I was attempting would work.  The only catch is that
reversibility would be weakened because in the 8086 architecture many
different long pointers can point to the same address, but I could live
with that.
-- 
Rahul Dhesi         UUCP:  {ihnp4,seismo}!{iuvax,pur-ee}!bsu-cs!dhesi

meissner@xyzzy.UUCP (Michael Meissner) (08/26/87)

In article <934@bsu-cs.UUCP> dhesi@bsu-cs.UUCP (Rahul Dhesi) writes:
# 
# To subtract two independent large-model pointers of the type
# segment:offset, I tried this:
# 
#      (unsigned long) p2 - (unsigned long) p1
# 
# I was hoping that the cast to unsigned long would convert each pointer
# to a sort of absolute memory address in bytes, and the subtraction
# would yield the difference in bytes.  Under Borland's Turbo C at least,
# such a cast is a no-op, so the resulting unsigned long does not
# necessarily increase monotonically with increasing memory address to
# which the original pointer points.

This is bad practice.  I know of machines that have different formats for
pointers to words and pointers to bytes, and other machines that use things
like bit pointers.  In none of these cases, or a segmented machine like the
80*86 will subtraction give you what you want.  This is yet another symptom
of the world is not a VAX syndrome.
-- 
Michael Meissner, Data General.		Uucp: ...!mcnc!rti!xyzzy!meissner
					Arpa/Csnet:  meissner@dg-rtp.DG.COM

ed@mtxinu.UUCP (Ed Gould) (08/28/87)

># To subtract two independent large-model pointers of the type
># segment:offset, I tried this:
># 
>#      (unsigned long) p2 - (unsigned long) p1
># 
>
>This is bad practice.

It's also not legal in the proposed ANSI C standard.  Pointers
may be subtracted *only* if they point to members of the same
array of elements.  Casting them has no real effect on a byte-
addressed machine; it's not at all obvious what it should do
on other machines.

-- 
Ed Gould                    mt Xinu, 2560 Ninth St., Berkeley, CA  94710  USA
{ucbvax,decvax}!mtxinu!ed   +1 415 644 0146

"A man of quality is not threatened by a woman of equality."

randy@umn-cs.UUCP (Randy Orrison) (08/28/87)

In article <483@mtxinu.UUCP> ed@mtxinu.UUCP (Ed Gould) writes:
>It's also not legal in the proposed ANSI C standard.  Pointers
>may be subtracted *only* if they point to members of the same
>array of elements.

How is this determined?  example:

int
strlen(s)
char *s;
{
	register char	*c;

	c = s;
	while(c++)
		;
	return (c-s);
}

How does anything know if s & c are pointing to members of the same array?
If s isn't 0 terminated, c could end up anywhere...

(No flames on off-by-one errors, or any design issues.  this is just an
example)

	-randy
-- 
Randy Orrison, University of Minnesota School of Mathematics
UUCP:	{ihnp4, seismo!rutgers!umnd-cs, sun}!umn-cs!randy
ARPA:	randy@ux.acss.umn.edu		 (Yes, these are three
BITNET:	randy@umnacvx			 different machines)

jc@minya.UUCP (John Chambers) (08/29/87)

In article <483@mtxinu.UUCP>, ed@mtxinu.UUCP (Ed Gould) writes:
> ># To subtract two independent large-model pointers of the type
> ># segment:offset, I tried this:
> ># 
> >#      (unsigned long) p2 - (unsigned long) p1
> ># 
> >
> >This is bad practice.
> 
> It's also not legal in the proposed ANSI C standard.  Pointers
> may be subtracted *only* if they point to members of the same
> array of elements.  

Huh?  This example isn't subtracting pointers to anything.  It is 
subtracting two unsigned longs.  I sure hope that's defined.

I also hope that the ANSI standards haven't done THAT much damage
to C semantics!

(:-)








-- 
	John Chambers <{adelie,ima,maynard}!minya!{jc,root}> (617/484-6393)

guy%gorodish@Sun.COM (Guy Harris) (08/29/87)

> How does anything know if s & c are pointing to members of the same array?
> If s isn't 0 terminated, c could end up anywhere...

If "s" isn't 0 terminated, the result returned from "strlen" isn't meaningful
anyway!  As such, the fact that "c" might not be in the same array is hardly
relevant.

The rules don't say that the implementation MUST detect whether the two
pointers belong to the same array, and slap your wrists if they aren't; they
say that the behavior is *undefined* if the pointers aren't members of the same
array!  As such, nobody *has* to know if "s" and "c" are pointing to members of
the same array.  In any *valid* call to "strlen", the pointers will be members
of the same array:

	1) It could be a string constant, which is an array;

	2) It could be an object declared as an array;

	3) It could be an array allocated by "malloc", or an array that is a
	   component of an object allocated by "malloc".

In all *these* cases, if the array contains a valid string, the call to
"strlen" must return a meaningful result, and your sample code for "strlen"
will subtract two pointers that point to members of the same array, or a
pointer that points to a member of an array from a pointer one past the end of
that array, both of which are valid.
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

gwyn@brl-smoke.ARPA (Doug Gwyn ) (08/29/87)

In article <2130@umn-cs.UUCP>, randy@umn-cs.UUCP (Randy Orrison) writes:
> In article <483@mtxinu.UUCP> ed@mtxinu.UUCP (Ed Gould) writes:
> >Pointers may be subtracted *only* if they point to members of the same
> >array of elements.
> How is this determined?  example:
> strlen(s)
> 	return (c-s);

Obviously all characters in a string are in the same object (be it
(char []) or chunk of malloc()-allocated storage.  I don't recall if
the latter is covered by the draft proposed standard but it should be.

If some code violates the same-aggregate pointer constraint, the
behavior is unspecified.  It might work or it might not.  No portable
program should violate the constraint.

cik@l.cc.purdue.edu (Herman Rubin) (08/29/87)

In article <483@mtxinu.UUCP>, ed@mtxinu.UUCP (Ed Gould) writes:
 
> It's also not legal in the proposed ANSI C standard.  Pointers
> may be subtracted *only* if they point to members of the same
> array of elements.  

The fact that some `gurus' cannot see the uses of this construct, as well
as others such as goto's, forcing inline, etc., is no more appropriate
than prohibiting the use of any tools developed since 1800 to sculptors.
You have no way of knowing how I can use the power of the machine; I may
very well find a new way of doing some things tomorrow that I do not see
today.  Let us remove unnecessary restrictions from the languages.

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet

barmar@think.COM (Barry Margolin) (08/30/87)

In article <572@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>In article <483@mtxinu.UUCP>, ed@mtxinu.UUCP (Ed Gould) writes:
> 
>> It's also not legal in the proposed ANSI C standard.  Pointers
>> may be subtracted *only* if they point to members of the same
>> array of elements.  
>
>The fact that some `gurus' cannot see the uses of this construct, as well
>as others such as goto's, forcing inline, etc., is no more appropriate
>than prohibiting the use of any tools developed since 1800 to sculptors.
>You have no way of knowing how I can use the power of the machine; I may
>very well find a new way of doing some things tomorrow that I do not see
>today.  Let us remove unnecessary restrictions from the languages.

This is not an unnecessary restriction.  It is there because the
construct is non-portable, and the purpose of the C standard (indeed,
ANY language standard) is to define a language in which portable
programs may be written.  No one is prohibiting you from subtracting
pointers to your heart's delight on machines where it makes sense;
just be aware that the standard doesn't specify what the result will
be, so your program may behave differently on different architectures.
In fact, I know of an architecture where it may return different
results for pointers to the same two objects at different times: the
Symbolics Lisp Machine.  It has a garbage collector that moves objects
around in memory, so the addresses may change, and therefore the
difference may change.  (Note: I've never used their C compiler, so I
don't know it will do this; however, I also believe that their
architecture allows them to detect comparisons of pointers to
different arrays).


---
Barry Margolin
Thinking Machines Corp.

barmar@think.com
seismo!ththers' arta

allbery@ncoast.UUCP (08/30/87)

As quoted from <2130@umn-cs.UUCP> by randy@umn-cs.UUCP (Randy Orrison):
+---------------
| In article <483@mtxinu.UUCP> ed@mtxinu.UUCP (Ed Gould) writes:
| >It's also not legal in the proposed ANSI C standard.  Pointers
| >may be subtracted *only* if they point to members of the same
| >array of elements.
| 
| How is this determined?  example:  [deleted.  ++bsa]
| How does anything know if s & c are pointing to members of the same array?
| If s isn't 0 terminated, c could end up anywhere...
+---------------

I think that they mean that the result is only defined if the pointers are
pointing to members of the same structure; in any other situation, you may
get a number result but it may not have any meaning.
-- 
	    Brandon S. Allbery, moderator of comp.sources.misc
  {{harvard,mit-eddie}!necntc,well!hoptoad,sun!mandrill!hal}!ncoast!allbery
ARPA: necntc!ncoast!allbery@harvard.harvard.edu  Fido: 157/502  MCI: BALLBERY
   <<ncoast Public Access UNIX: +1 216 781 6201 24hrs. 300/1200/2400 baud>>
** Site "cwruecmp" has changed its name to "mandrill".  Please re-address **
*** all mail to ncoast to pass through "mandrill" instead of "cwruecmp". ***

allbery@ncoast.UUCP (08/30/87)

As quoted from <572@l.cc.purdue.edu> by cik@l.cc.purdue.edu (Herman Rubin):
+---------------
| In article <483@mtxinu.UUCP>, ed@mtxinu.UUCP (Ed Gould) writes:
|  
| > It's also not legal in the proposed ANSI C standard.  Pointers
| > may be subtracted *only* if they point to members of the same
| > array of elements.  
| 
| The fact that some `gurus' cannot see the uses of this construct, as well
| as others such as goto's, forcing inline, etc., is no more appropriate
| than prohibiting the use of any tools developed since 1800 to sculptors.
+---------------

Sure -- but, while your program may work fine on a Vax or a Sun, will it work
on a Cray-1?  LLNL's S-1?  The ANSI C standard defines *portable* code; you
can code something that works on your machine but doesn't conform, but don't
expect it to work on every machine.  (Example:  the difference between two
pointers not both associated with the same array may be meaningless on a
tagged architecture, and may result in either a garbage result or a memory
fault.)

Subtracting pointers is a different kind of restriction from the use of "goto";
the latter is a *stylistic* restriction, the former is a *portability*
restriction.
-- 
	    Brandon S. Allbery, moderator of comp.sources.misc
  {{harvard,mit-eddie}!necntc,well!hoptoad,sun!mandrill!hal}!ncoast!allbery
ARPA: necntc!ncoast!allbery@harvard.harvard.edu  Fido: 157/502  MCI: BALLBERY
   <<ncoast Public Access UNIX: +1 216 781 6201 24hrs. 300/1200/2400 baud>>
** Site "cwruecmp" has changed its name to "mandrill".  Please re-address **
*** all mail to ncoast to pass through "mandrill" instead of "cwruecmp". ***

root@hobbes.UUCP (08/31/87)

+---- Herman Rubin writes in <572@l.cc.purdue.edu> ----
| +---- Ed Gould writes ----
| | It's also not legal in the proposed ANSI C standard.  Pointers
| | may be subtracted *only* if they point to members of the same
| | array of elements.  
| +----
| You have no way of knowing how I can use the power of the machine; I may
| very well find a new way of doing some things tomorrow that I do not see
| today.  Let us remove unnecessary restrictions from the languages.
+----

*** The following is only valid on intel 808x architecture machines ***
	    Followups are directed to comp.sys.intel

On the intel chips (and I'm sure on many others) some compiler's malloc()
routines align memory requests on 16 byte boundries.

So, if you did:				You might get: _________
	char *p1, *p2, *p3;			      /________/|
	p1 = malloc(20);			p1 -->|20 bytes||
	p2 = malloc(20);			      +--------+/
	p3 = p2 - p1;				       _________
						      /________/|
					       filler |? bytes ||
						      +--------+/
						       _________
						      /________/|
						p2 -->|20 bytes||
						      +--------+/

and p2 - p1 would NOT give you a useful number!  THAT is why ANSI said that the
result was undefined.  Not illegal, just undefined.  This means that compiler
writers can do stuff like this without having to worry about breaking code.

Iff you know what your compiler does AND iff you don't care about portability
then you can use the info like this:

printf("On this machine there are %ld bytes of filler between p1 and p2\n",
	(unsigned long) ( (unsigned long)p2 - (unsigned long)p1 ) - 20);

or somesuch. ( This code WILL NOT WORK on intel chips.  See below)



-- New Subject:  pointer manipulation on intel chips --

Note:  This DOES NOT pertain to the usual "*(a+3)" or "if (p1 == p2)" stuff
which is called "pointer arithmetic" or "pointer manipulation" in languages
like C.  It instead refers to "dissecting" the value of "&foobar".  This comes
in when you wish to do things like the p3 = p2 - p1; above where p1 and p2
point to different aggregates.  The C compiler already takes care of the first
cases for you.


If you wish to do pointer manipulation on the intel 808x chips you need to
recognize how a pointer is constructed:

	A pointer has 2 parts, a SEGMENT and an OFFSET, each 16 bits in length.
    e.g.:  1040:3333
	SEGMENT:OFFSET
    
    In the "small" model, the SEGMENT is an unchanging value stored in a 
    register and the OFFSET is what is used as a "pointer" in C.

    In the "large" model, a pointer consists of a 32 bit structure which
    contains two 16 bit values, the SEGMENT and the OFFSET.

    The SEGMENT and the OFFSET are combined to make a 20 bit address like this:

    SEGMENT	[0001|0000|0100|0000]			0x1040
    OFFSET	     [0011|0011|0011|0011]		0x3333
	    --------------------------
    ADDRESS	[0001|0011|0111|0011|0011]	1040:3333 or 1000:3733 or
							     1001:3633 or
		Note: a pointer may have many 		     1002:3533 or
		values and still point to the same thing!       ...    or
							     1373:0003
    
    To convert the pointer 0040:3333 to an unsigned long address we use the
    formula (SEGMENT * 16) + OFFSET to get:

	    (0x1040  * 16) + 0x3333 = 0x00013733
    
    Note: even though a pointer may have many values, it has only ONE address!

    On the 808x chips this is a physical ADDRESS, but NOT a valid POINTER.
    Note that in this discussion, pointers are not addresses and
    addresses are not pointers!

    Two addresses may be subtracted to obtain a valid number which is the
    absolute difference (in bytes) of their physical locations.

    An address may be converted into a normalized pointer by constructing a
    SEGMENT:OFFSET pair where the lower 12 bits of the SEGMENT are ZERO.
	segment = (unsigned short)(address & 0x000F0000) / 16;
	offset  = (unsigned short)(address & 0x0000FFFF);

    Only pointers which A) are normalized, or B) have the same SEGMENT value
    can be validly compared for equality.  All addresses can be validly
    compared for equality.

Intel bashing flames should go to /dev/null, glaring errors should be emailed.
minor errors should be ignored.
-- 
John Plocher uwvax!geowhiz!uwspan!plocher  plocher%uwspan.UUCP@uwvax.CS.WISC.EDU

dave@murphy.UUCP (Dave Cornutt) (08/31/87)

In article <7939@think.UUCP>, barmar@think.COM (Barry Margolin) writes:
> In article <572@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> >In article <483@mtxinu.UUCP>, ed@mtxinu.UUCP (Ed Gould) writes:
> > 
> >> It's also not legal in the proposed ANSI C standard.  Pointers
> >> may be subtracted *only* if they point to members of the same
> >> array of elements.  
> >
> 
> This is not an unnecessary restriction.  It is there because the
> construct is non-portable, and the purpose of the C standard (indeed,
> ANY language standard) is to define a language in which portable
> programs may be written.

I'll agree that the construct posted, (long) p1 - (long) p2, is nonportable
and should be flagged as such.  However, I will say this, because I don't
think the standard has really addressed it: there should be a way to take
any pointer and generate a byte offset from byte 0 in whatever address
space the code is running in.  The reason is that you need such a beast
to feed to lseek if you want to access something through one of the
/dev/mem devices (or maybe /proc).

> In fact, I know of an architecture where it may return different
> results for pointers to the same two objects at different times: the
> Symbolics Lisp Machine.  It has a garbage collector that moves objects
> around in memory, so the addresses may change, and therefore the
> difference may change.

I must be missing something here.  Admittedly, I don't know anything about
this machine, but it looks like this garbage collection would make pointers
useless, since there is no guarantee that, when you dereference a pointer,
the object that you're referring to will be in the same place that it was
when you obtained the address.  I can see how it could be done using some
sort of highly segmented memory, but it seems like the overhead would be
enormous (i.e., the iAPX 432).  How does this work?
---
"I dare you to play this record" -- Ebn-Ozn

Dave Cornutt, Gould Computer Systems, Ft. Lauderdale, FL
[Ignore header, mail to these addresses]
UUCP:  ...!{sun,pur-ee,brl-bmd,seismo,bcopen,rb-dc1}!gould!dcornutt
 or ...!{ucf-cs,allegra,codas,hcx1}!novavax!gould!dcornutt
ARPA: dcornutt@gswd-vms.arpa

"The opinions expressed herein are not necessarily those of my employer,
not necessarily mine, and probably not necessary."

guy@gorodish.UUCP (08/31/87)

> However, I will say this, because I don't think the standard has really
> addressed it:  there should be a way to take any pointer and generate a byte
> offset from byte 0 in whatever address space the code is running in.  The
> reason is that you need such a beast to feed to lseek if you want to access
> something through one of the /dev/mem devices (or maybe /proc).

The standard should NOT address this.  The standard mentions neither "lseek"
nor "/dev/mem" nor "/proc".  This sort of thing is rather non-portable, and is
as such completely outside the scope of the standard.  Since getting at some
other address space must be done in a different fashion on different
implementations, it is perfectly OK to require that getting the location in
that other address space also be done in a different fashion on different
implementations.
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

gwyn@brl-smoke.ARPA (Doug Gwyn ) (09/01/87)

In article <588@murphy.UUCP> dave@murphy.UUCP (Dave Cornutt) writes:
>...  However, I will say this, because I don't
>think the standard has really addressed it: there should be a way to take
>any pointer and generate a byte offset from byte 0 in whatever address
>space the code is running in.  The reason is that you need such a beast
>to feed to lseek if you want to access something through one of the
>/dev/mem devices (or maybe /proc).

The ANSI C standard doesn't address this (pun intended?) because
the process may have incommensurable multiple data address spaces.
It cannot dictate the mapping to be used by UNIX /dev/*mem and
similar facilities; that's not within the scope of the C standard,
which also has to apply to non-UNIX-like environments.  It is up
to the operating system implementation to make things like that
work; it has nothing to do with the C language.

throopw@xyzzy.UUCP (09/01/87)

> dave@murphy.UUCP (Dave Cornutt)
> However, I will say this, because I don't
> think the standard has really addressed it: there should be a way to take
> any pointer and generate a byte offset from byte 0 in whatever address
> space the code is running in.  The reason is that you need such a beast
> to feed to lseek if you want to access something through one of the
> /dev/mem devices (or maybe /proc).

By "the standard", I presume draft X3J11 is meant.  First, the C
language standard had better say nothing that requires there to even
*BE* a single, linear address space in which "code is running".  There
are many machines where this isn't a well-founded presumption.  Thus the
whole idea of a process-unique "byte 0", or "a byte offset" from there
may not be present in the hardware for which the C source is being
compiled.  To say nothing of whether a C language standard should be
talking about "lseek" and "/dev/mem" in its rationale for a general
feature.

In fact, "the standard" says just about what it ought to say.  It gives
liscence to developers for whom it is natural to supply "byte offsets
from byte 0" to supply them, but does not require it from those
developers for whom it is an impossibility.

--
1+1=3, for sufficently large values of 1.
-- 
Wayne Throop      <the-known-world>!mcnc!rti!xyzzy!throopw

barmar@think.UUCP (09/02/87)

In article <588@murphy.UUCP> dave@murphy.UUCP (Dave Cornutt) writes:
>> In fact, I know of an architecture where it may return different
>> results for pointers to the same two objects at different times: the
>> Symbolics Lisp Machine.  It has a garbage collector that moves objects
>> around in memory, so the addresses may change, and therefore the
>> difference may change.
>
>I must be missing something here.  Admittedly, I don't know anything about
>this machine, but it looks like this garbage collection would make pointers
>useless, since there is no guarantee that, when you dereference a pointer,
>the object that you're referring to will be in the same place that it was
>when you obtained the address.  I can see how it could be done using some
>sort of highly segmented memory, but it seems like the overhead would be
>enormous (i.e., the iAPX 432).  How does this work?

Whenever the garbage collector moves something, it effectively updates
all pointers to the object.  Most Lisp garbage collectors are of this
relocating variety these days, as it also tends to shrink the working
set and increase locality.  A particular pointer variable will always
point to the same object (until it is reassigned, of course), although
its internal numerical value may change.

I'm not sure how they deal with the fact that a pointer cast into an
integer and back into a pointer (or is it vice versa?) must maintain
its value.  My guess is that they maintain a hash table of pointers
that have been converted into integers.

As for the overhead, it's just part of the garbage collection that
Lisp programmers have been living with for decades.  It's worth it not
to have to keep track of when memory needs to be deallocated.  And
Lisp Machines have special hardware that optimizes GC.

---
Barry Margolin
Thinking Machines Corp.

barmar@think.com
seismo!think!barmar

bc@halley.UUCP (Bill Crews) (09/02/87)

In article <6357@brl-smoke.ARPA> gwyn@brl-smoke.ARPA (Doug Gwyn ) writes:
>In article <2130@umn-cs.UUCP>, randy@umn-cs.UUCP (Randy Orrison) writes:
>> In article <483@mtxinu.UUCP> ed@mtxinu.UUCP (Ed Gould) writes:
>> >Pointers may be subtracted *only* if they point to members of the same
>> >array of elements.
>> How is this determined?  example:
>> strlen(s)
>> 	return (c-s);
>
>Obviously all characters in a string are in the same object (be it
>(char []) or chunk of malloc()-allocated storage.

It seems to me that everyone is ignoring his Ed's point.  Let's say a function
is to take two pointer arguments, a pointer to a string and a pointer
into the string.  What you say seems to indicate that arithmetic expressions
involving both pointers, such as their difference, will produce unpredictable
results at execution time, because the called function has no way of knowing
whether the pointers are actually to the same "string" or not.

-bc
-- 
Bill Crews                                   Tandem Computers
                                             Austin, Texas
..!seismo!ut-sally!im4u!esc-bb!halley!bc     (512) 244-8350

daveb@geac.UUCP (Brown) (09/02/87)

In article <26910@sun.uucp> guy@gorodish.UUCP writes:
>> ...  there should be a way to take any pointer and generate a byte
>> offset from byte 0 in whatever address space the code is running in.  The
>> reason is that you need such a beast to feed to lseek if you want to access
>> something through one of the /dev/mem devices (or maybe /proc).
>
>The standard should NOT address this.  The standard mentions neither "lseek"
>nor "/dev/mem" nor "/proc".  This sort of thing is rather non-portable, and is
>as such completely outside the scope of the standard. 

  I agree that the standard should not address machine-specific issues (and
especially /dev/mem), but the implementors of particular compilers for the
language need to address the question.  (this is more of an arch. than a c
discussion, however).
  The Adavolutians have chosen to relegate the discussion of what optional
features a particular compiler has implemented to a specific appendix: the
standard writers might well define such an appendix for the C language.  It
can then address such issues where the poor client might be able to find it.

  --dave (I once did QA on a compiler: never again) c-b
-- 
 David Collier-Brown.                 {mnetor|yetti|utgpu}!geac!daveb
 Geac Computers International Inc.,   |  Computer Science loses its
 350 Steelcase Road,Markham, Ontario, |  memory (if not its mind)
 CANADA, L3R 1B3 (416) 475-0525 x3279 |  every 6 months.

peter@sugar.UUCP (Peter da Silva) (09/02/87)

> The standard should NOT address this.  The standard mentions neither "lseek"

Are you saying that the ANSI 'C' library includes all the UNIX date/time
functions, but doesn't include lseek?

Ack, oop.
-- 
-- Peter da Silva `-_-' ...!seismo!soma!uhnix1!sugar!peter
--                  U   <--- not a copyrighted cartoon :->

dhesi@bsu-cs.UUCP (Rahul Dhesi) (09/03/87)

In article <625@sugar.UUCP> peter@sugar.UUCP (Peter da Silva) writes:
>Are you saying that the ANSI 'C' library includes all the UNIX date/time
>functions, but doesn't include lseek?

One distinguishing difference between operating systems designed with
interactive use in mind (e.g. AmigaDOS, MS-DOS, UNIX) and operating
systems that trace their ancestry to the days of punched cards (e.g.
VAX/VMS, most IBM mainframe operating systems, and perhaps Primos) is
the inability of the latter to do an arbitrary lseek.

I speculate that the punched-card paradigm was most effectively
implemented on disk by storing the card image as [<length> <data>] thus
allowing cards of any length (not just 80 characters) to be stored, and
easily skipped in a sequential read without having to read each
character.

Counterexamples probably exist.
-- 
Rahul Dhesi         UUCP:  {ihnp4,seismo}!{iuvax,pur-ee}!bsu-cs!dhesi

guy%gorodish@Sun.COM (Guy Harris) (09/03/87)

> Are you saying that the ANSI 'C' library includes all the UNIX date/time
> functions, but doesn't include lseek?

That is precisely what I am saying, because it is true.  I find the presence of
the date/time functions in the C standard somewhat questionable, as that sort
of date/time conversion is usually an OS function - both the internal format
used to represent dates and/or times, and the printable format generally used,
are OS-dependent.

"lseek" isn't in the standard, but then neither are "open", "close", "read",
nor "write".  This is as it should be; a portable program can't expect any more
from those routines than from their standard I/O equivalents.  Consider a
system that supports records in files that being with a byte count, and use
FORTRAN carriage control at the beginning of the record in text files - in such
a system, "read" and "write" would have to perform the same sort of translation
on data in order to make UNIX programs work without change, "lseek" would have
to work with cookies rather than byte offsets, and if you wanted to be able to
use "read" or "write" to get at the "raw" binary data in the file, you'd have
to have a text/binary flag on "open", or something such as that.

If you want a standard that ensures UNIX-flavored behavior, use POSIX, not ANSI
C.  Both types of standard have their roles, but they are different roles.
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

gwyn@brl-smoke.ARPA (Doug Gwyn ) (09/04/87)

In article <625@sugar.UUCP> peter@sugar.UUCP (Peter da Silva) writes:
>Are you saying that the ANSI 'C' library includes all the UNIX date/time
>functions, but doesn't include lseek?

It doesn't include open(), read(), write(), fork(), etc. either.
The reason is that it is probably impossible to specify these
adequately in a common specification for all systems.  Since the
stdio routines ARE specified, there is little need for the
lower-level I/O routines in portable application programming.

The date/time functions are specified in a system-independent way
and are useful in portable applications.  The fact that they
originated in the UNIX C library is largely irrelevant; most of
the library routines in the proposed ANSI C standard did.

lseek(), read(), etc. are specified in IEEE 1003.1 (POSIX),
however, since it specifically addresses just UNIX-like systems.

gwyn@brl-smoke.ARPA (Doug Gwyn ) (09/04/87)

In article <27183@sun.uucp> guy%gorodish@Sun.COM (Guy Harris) writes:
-I find the presence of
-the date/time functions in the C standard somewhat questionable, as that sort
-of date/time conversion is usually an OS function - both the internal format
-used to represent dates and/or times, and the printable format generally used,
-are OS-dependent.

But the proposed ANSI C standard guarantees enough about these functions
to make them useful for portable programs.  That seems like a win.

jpn@teddy.UUCP (John P. Nelson) (09/04/87)

In article <625@sugar.UUCP> peter@sugar.UUCP (Peter da Silva) writes:
>> The standard should NOT address this.  The standard mentions neither "lseek"
>
>Are you saying that the ANSI 'C' library includes all the UNIX date/time
>functions, but doesn't include lseek?

The Draft Standard does not include any of the UNIX low-level io functions
(read/write/open/close) including lseek.  Fseek IS supported.

The rationale says something to the effect that the low level functions
are 1. redundant, 2. not necessarily any more efficient than the FILE
based functions.  They do mention the POSIX standard, and that those
functions will be defined there.

meissner@xyzzy.UUCP (Michael Meissner) (09/04/87)

> > The standard should NOT address this.  The standard mentions neither "lseek"
> 
> Are you saying that the ANSI 'C' library includes all the UNIX date/time
> functions, but doesn't include lseek?

Yes.  The functions open/read/write/lseek/close/ioctl/dup/dup2, etc. are all
in the province of POSIX.  Ansi C only deals with the stanard I/O functions
for I/
-- 
Michael Meissner, Data General.		Uucp: ...!mcnc!rti!xyzzy!meissner
					Arpa/Csnet:  meissner@dg-rtp.DG.COM

guy%gorodish@Sun.COM (Guy Harris) (09/04/87)

> Let's say a function is to take two pointer arguments, a pointer to a
> string and a pointer into the string.  What you say seems to indicate
> that arithmetic expressions involving both pointers, such as their
> difference, will produce unpredictable results at execution time, because
> the called function has no way of knowing whether the pointers are actually
> to the same "string" or not.

It indicates no such thing.  The called function doesn't *have to* know whether
the pointers point to elements of the same array; it is free to subtract them
*as if* they were, since if they are a correct result will be produced if the
program is compiled by a conforming compiler.  As such, if the called function
is called correctly, so that the two pointers *do* point to members of the same
array, there will be no problem.

If they do not point to members of the same array, the generated code that
subtracts them can produce the "expected" result, produce a garbage result, or
trigger global thermonuclear war; it is not obliged to worry about this.  The
Standard "imposes no requirements" on the behavior of an implementation in a
particular situation if the Standard indicates that behavior in that situation
is "undefined".  "Permissible behavior ranges from ignoring the situation
completely with unpredictable results, to behaving during translation or
program exution in a documented manner characteristic of the environment
(with or without the issuance of a diagnostic message), to terminating a
translation or execution (with the issuance of a diagnostic message)."

People seem to be having trouble with this point, so I'll give some concrete
examples.

In a system with a flat address space, and where pointer subtraction is done by
treating the bit patterns in the pointers as integral quantities, subtracting
them, and dividing the result by the size of the object type to which both
pointers point, subtraction of two "char *"s that do not point to members of
the same array will produce the "expected" result, namely the distance between
the two addresses in "char"-sized units.

In a system with a segmented address space, where pointer subtraction is done
by subtracting the offsets of the pointers, an entire array must fit into a
segment in order to produce a conforming implementation.  (You may even have to
ensure that the last element doesn't end on the last address of the segment, in
order that you can also subtract a pointer from "pointer_to_last_element"+1.)
If both pointers point to objects in the same array, the segment number is
irrelevant; subtracting the offsets gives the correct result.  If both pointers
point to objects in different segments (which are obviously not members of the
same array), you will get a meaningless result.  This is not a problem for e.g.
"strlen"; "strlen" *will* give the correct length if handed a real string (such
that all characters, including the null character, are members of the same
array, and thus in the same segment).  What it does when handed something that
isn't a real string is irrelevant.
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

jpn@teddy.UUCP (John P. Nelson) (09/05/87)

[Lots of stuff about subtracting pointers deleted...]
>
>It seems to me that everyone is ignoring his Ed's point.  Let's say a function
>is to take two pointer arguments, a pointer to a string and a pointer
>into the string.  What you say seems to indicate that arithmetic expressions
>involving both pointers, such as their difference, will produce unpredictable
>results at execution time, because the called function has no way of knowing
>whether the pointers are actually to the same "string" or not.

This is the wrong way of looking at it.  The compiler is free to ASSUME
that the two pointers point to members of a single array: otherwise the
program would not be a "strictly conforming program".  The compiler
does not HAVE TO decide if the two pointers can be subtracted:  The
standard says that it is the programmer's problem to assure this.  The
compiler is free to do just about anything if the program is
incorrect.  In other words, in a segmented architecture (like the
8086), the compiler can ASSUME that the segments are identical, and
perform the computation on the two pointer offsets, because if the
segments are different, the program is not correct.

In a tagged architecture (or an interpreted environment), the fact that
the two pointers to not point to members of an array might be detected
at RUN TIME.  The standard says that this is a perfectly valid
approach.  If you have a linear address space, where all addresses fit
into an integer of some kind, the standard does not forbid returning
the distance between two arbitrary pointers.  You simply cannot assume
that this will work for all implementations.

mc68020@gilsys.UUCP (Thomas J Keller) (09/05/87)

In article <6397@brl-smoke.ARPA>, gwyn@brl-smoke.ARPA (Doug Gwyn ) writes:
> In article <625@sugar.UUCP> peter@sugar.UUCP (Peter da Silva) writes:
> >Are you saying that the ANSI 'C' library includes all the UNIX date/time
> >functions, but doesn't include lseek?
> It doesn't include open(), read(), write(), fork(), etc. either.
> The reason is that it is probably impossible to specify these
> adequately in a common specification for all systems.  Since the
> stdio routines ARE specified, there is little need for the
> lower-level I/O routines in portable application programming.

   So in other words, Mr. Gwyn, what you are saying is that the ANSI C
workgroup has taken it upon themselves to decide that "portable applications"
programs have NO NEED to do other than straight sequential I/O on files,
is this correct?  How very paternalistic of them! 

   Sounds to me as if some (most?) of the people on that group are making
some pretty heavy assumptions, some of which may well BREAK the usefulness
of the ANSI C standard sufficiently as to render it totally USELESS.

-- 
Tom Keller 
VOICE  : + 1 707 575 9493
UUCP   : {ihnp4,ames,sun,amdahl,lll-crg,pyramid}!ptsfa!gilsys!mc68020

allbery@ncoast.UUCP (09/05/87)

As quoted from <625@sugar.UUCP> by peter@sugar.UUCP (Peter da Silva):
+---------------
| > The standard should NOT address this.  The standard mentions neither "lseek"
| 
| Are you saying that the ANSI 'C' library includes all the UNIX date/time
| functions, but doesn't include lseek?
+---------------

We're not talking about UNIX standards, we're talking about C standards.
Date and time are easily convertible under any OS; but how do you implement
a byte-oriented lseek() under VMS?  VM/CMS on an IBM?  (Both use fixed 80-byte
records for text files -- NOT byte streams! -- and other record formats for
non-text files.)
-- 
	    Brandon S. Allbery, moderator of comp.sources.misc
  {{harvard,mit-eddie}!necntc,well!hoptoad,sun!mandrill!hal}!ncoast!allbery
ARPA: necntc!ncoast!allbery@harvard.harvard.edu  Fido: 157/502  MCI: BALLBERY
   <<ncoast Public Access UNIX: +1 216 781 6201 24hrs. 300/1200/2400 baud>>
All opinions in this message are random characters produced when my cat jumped
(-:		      up onto the keyboard of my PC.			   :-)

peter@sugar.UUCP (Peter da Silva) (09/06/87)

> a system, "read" and "write" would have to perform the same sort of translation
> on data in order to make UNIX programs work without change, "lseek" would have
> to work with cookies rather than byte offsets, and if you wanted to be able to
> use "read" or "write" to get at the "raw" binary data in the file, you'd have
> to have a text/binary flag on "open", or something such as that.

I have no problem with any of that. On systems where it is not appropriate
to use read() and write() as the primitives, implement them using fread()
and fwrite(). There is certainly a precedent for having flags on open(),
too.
-- 
-- Peter da Silva `-_-' ...!seismo!soma!uhnix1!sugar!peter
--                 'U`  <-- Public domain wolf.

rsalz@bbn.com (Richard Salz) (09/09/87)

In article <6397@brl-smoke.ARPA>, gwyn@brl-smoke.ARPA (Doug Gwyn ) explains
that ANSI doesn't specify open, read, write.

To which, in comp.unix.wizards (<1122@gilsys.UUCP>), mc68020@gilsys.UUCP
(Thomas J Keller) writes:
>   So in other words, Mr. Gwyn, what you are saying is that the ANSI C
>workgroup has taken it upon themselves to decide that "portable applications"
>programs have NO NEED to do other than straight sequential I/O on files,
>is this correct?  How very paternalistic of them! 

In general, Doug's postings have to be read the same way you read K&R or
the vintage Unix manuals (i.e., then the programmers wrote them, not a
separate techdoc department):  pay attention to every word, and give as
much note to what is not said, as to what is said.  It is not appropriate
for ANSI to specify the "Unix system-call" level, it is appropriate for
them to document the "standard I/O level."  Hence, X3J11 does specify
fseek.

Please read, and ponder, more carefully before you make snide, insulting
comments -- especially to or about people as useful to the net as Doug.
	/r$
-- 
For comp.sources.unix stuff, mail to sources@uunet.uu.net.

guy%gorodish@Sun.COM (Guy Harris) (09/09/87)

> I have no problem with any of that. On systems where it is not appropriate
> to use read() and write() as the primitives, implement them using fread()
> and fwrite().

What would this buy you, other than a false sense of security when moving UNIX
programs to non-UNIX systems?
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

allbery@ncoast.UUCP (Brandon Allbery) (09/09/87)

As quoted from <286@halley.UUCP> by bc@halley.UUCP (Bill Crews):
+---------------
| It seems to me that everyone is ignoring his Ed's point.  Let's say a function
| is to take two pointer arguments, a pointer to a string and a pointer
| into the string.  What you say seems to indicate that arithmetic expressions
| involving both pointers, such as their difference, will produce unpredictable
| results at execution time, because the called function has no way of knowing
| whether the pointers are actually to the same "string" or not.
+---------------

The point is that, while the subtraction might be doable, on a given
architecture subtracting two pointers not into the same array might not
have a meaning.  On such hardware, the MMU knows what's what and the
addresses reflect this.

This is basically a declaration that the software isn't required to spin
its wheels trying to deal with "unusual" hardware (can you truly call a
PC "unusual?  But large-model pointers are susceptible, since multiple
<segment>:<offset> pairs may point to the same address.  Pointer normalization
is expensive).  Note that in all of these cases, falling off the end of a
string will still yield a valid pointer (for subtraction, at least) --
although some tagged architectures might hand you a segmentation violation
when you try to go beyond the defined end of the string/data area.
-- 
	    Brandon S. Allbery, moderator of comp.sources.misc
  {{harvard,mit-eddie}!necntc,well!hoptoad,sun!mandrill!hal}!ncoast!allbery
ARPA: necntc!ncoast!allbery@harvard.harvard.edu  Fido: 157/502  MCI: BALLBERY
   <<ncoast Public Access UNIX: +1 216 781 6201 24hrs. 300/1200/2400 baud>>
All opinions in this message are random characters produced when my cat jumped
(-:		      up onto the keyboard of my PC.			   :-)

guy%gorodish@Sun.COM (Guy Harris) (09/09/87)

>    So in other words, Mr. Gwyn, what you are saying is that the ANSI C
> workgroup has taken it upon themselves to decide that "portable applications"
> programs have NO NEED to do other than straight sequential I/O on files,
> is this correct?

"fseek" is in the standard.  What is this nonsense about "straight sequential
I/O?"
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

henry@utzoo.UUCP (Henry Spencer) (09/09/87)

>    So in other words, Mr. Gwyn, what you are saying is that the ANSI C
> workgroup has taken it upon themselves to decide that "portable applications"
> programs have NO NEED to do other than straight sequential I/O on files,
> is this correct? ...

Nonsense.  Stdio includes fread, fwrite, and fseek.  The X3J11 drafts do
put some restrictions on portable uses of them, which are inevitable given
that the full generality of something like Unix seeks is unimplementable
on some systems.  The question is not whether portable applications have
real needs to do strange things, but whether these strange things can be
done in a *portable* way that will work on *most* machines.  Often they can,
*if* one is willing to work at the level of stdio and observe some extra
restrictions.  It is neither appropriate nor practical for X3J11 to wave
a magic wand and declare that any system which can't implement full Unix
semantics is broken.  There really are things which simply CANNOT BE DONE
in a portable way, and people writing portable programs or designing tools
for writing portable programs must acknowledge this.
-- 
"There's a lot more to do in space   |  Henry Spencer @ U of Toronto Zoology
than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry

jc@minya.UUCP (09/12/87)

Hey, what for are all youse guys talkin' about fseek, lseek, and even read 
and write, when da Subject line says:
	Re: pointer alignment when int != char *

Ennyhow, dis here group is supposta be about C.  C doesn't have I/O.  
It just has functions calls; it don't from nowhere about I/O.

[Oops, pardon the Chicagoese; please move the discussion to some place
where it is relevant, like maybe comp.os.unix or something similar.  As
for any claim that some C standards discuss irrelevancies like I/O, well,
there *is* a newsgroup for discussing C standards.]

-- 
	John Chambers <{adelie,ima,maynard}!minya!{jc,root}> (617/484-6393)

mouse@mcgill-vision.UUCP (09/22/87)

In article <8024@think.UUCP>, barmar@think.COM (Barry Margolin) writes:
> In article <588@murphy.UUCP> dave@murphy.UUCP (Dave Cornutt) writes:
[>>> is someone else; Dave is >>]
>>> [Lisp Machines have a garbage collector which moves objects,
>>> resulting in pointers that change behind your back]
>> [but this means a pointer can be invalidated on you]
> [when GC moves something, it updates all pointers.]
> A particular pointer variable will always point to the same object
> (until it is reassigned, of course), although its internal numerical
> value may change.

> I'm not sure how they deal with the fact that a pointer cast into an
> integer and back into a pointer (or is it vice versa?) must maintain
> its value.  My guess is that they maintain a hash table of pointers
> that have been converted into integers.

Lisp doesn't have that sort of cast (well, it usually does, but only as
a documented-to-be-dangerous subprimitive).  Does Symbolics or LMI or
anyone provide a C compiler for a Lisp Machine?  If they do, I would
guess that either they do as you suggest, maintaining some table of
objects which must not be moved, or they have hooks into the garbage
collector permitting the C run-time to lock objects down, or they
simply make the whole C environment one lisp object, which must be
relocated as a whole if it is relocated at all.  (Then of course all
pointers will be relative to the beginning of this area.)

					der Mouse

				(mouse@mcgill-vision.uucp)