[comp.lang.c] type-indexed arrays

karl@haddock.UUCP (Karl Heuer) (06/09/87)

In article <20540@sun.uucp> guy%gorodish@Sun.COM (Guy Harris) writes:
>[haddock!karl (Karl Heuer) writes:]
>>Given the suggested "pointer-like" semantics for arithmetic, and the
>>ability to declare an array indexed by enums (or other types!), I could
>>find more use for them.
>>int hue[enum color] = { RED: 0x00F, GREEN: 0x0F0, BLUE: 0xF00 }; /* I wish */
>
>I agree that it would be nice to support declaring arrays indexed by
>"enum"s.  (What "other types" would be indices?  "char", presumably;
>the only other types I could see would be subrange types - are you
>proposing adding them to C?)

Maybe someday -- but no, that wasn't what I was thinking.  Some machines can
support a[short] with no trouble, but that would raise the question of which
types a conforming implementation is *required* to support (and hence can be
used in a portable program)...

>One interesting possibility is suggested by the fact that the set of
>numerical encodings of "enum" values need not be "dense", and thus that
>	enum cookie { chocolate_chip = 1, oreo = 10, oatmeal = 100 };
>	int price[enum cookie];
>might cause "price" to be a sparse array, with lookups in that array
>done by some technique other than indexing.  This is analogous to
>"switch" statements; the compiler selects different implementations
>based on the set of values used, with indexing used for a
>sufficiently "dense" set, various combinations of hashing and
>table-search used for less-"dense" sets, and a cascade of comparisons
>used for other sets.

I glad you posted this -- I was going to mention the same idea myself, but I
prefer the feedback so I can tell someone is listening to my ravings.  :-)
In your example, the implementation might choose to allocate three ints and
use a cascade of comparisons (actually, one comparison against "oreo" would
suffice).

Now, we could assert that a conforming implementation must support *all* types
as subscripts, and that "int a[int]" will, if necessary, cause some sort of
"smart" lookup.  The problem now is that the amount of space required is not
known at compile time.  If the compiler is going to use a hash table, it can
probably make good use of a "hint" from the programmer.  (Which probably means
another new keyword.  Ugh.)

Incidentally, I am not suggesting that this feature be incorporated into the
new standard -- I realize there's very little hope for that.  However, I *do*
believe it is a good idea (at least for enum), and would like to see it
implemented as an extension.  If it catches on, it may be a candidate for the
__STDC__==2 release.

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

rwa@auvax.UUCP (Ross Alexander) (06/12/87)

In article <526@haddock.UUCP>, karl@haddock.UUCP (Karl Heuer) writes:
> Some machines can support a[short] with no trouble, but that would raise
> the question of which types a conforming implementation is *required*
> to support (and hence can be used in a portable program) [...]

and a little further along,

> Now, we could assert that [the compiler] must support *all* types
> as subscripts, and that "int a[int]" will, if necessary [...]

and yet further,

> Incidentally, I am not suggesting that this feature be incorporated into the
> new standard -- I realize there's very little hope for that.  However, I *do*
> believe it is a good idea (at least for enum) [...]

I can think of only one little quibble with the above:  C arrays are
considered to begin with the zero'th element and end at the N-1'th
element, right?  So to declare something as 

	int	array[ short ];

implies that array[] has members indexed by -1, -2, -3, and so on
('short' is a signed type, no?).  So Karl will also have to think of
some way to introduce the declaration of array index bases, as well
as array index ranges, a la Algol-60 or Pascal; and this then
introduces a rather large can of worms when pointers to based arrays
are passed around.  Say I declare an array

	int	demo[ 21 : -10 ];

which I define to mean 'an array of ints, called demo, 21 ints long,
first element is demo[ -10 ]'.  Fine.  Now, what is the value of the
token 'demo' ?  Is it '& demo[ -10 ]' ?  Is it '& demo[ 0 ]' (this is
K&R's idea of 'demo', I think) ?  And further, when I then code up
something like 'frobozz( 2, "hi there", demo );' and frobozz() looks
like

	double frobozz( a, b, c );
		int a;
		char * b;
		int c[];
	{ /* and so on... */

what is the allowable range of indicies of c[] ? -10 .. +10 ?  0 ..
20 ?  Beats _me_ :-).

As a quick out (because I _do_ like the basic idea), how about
saying that subscript types must be unsigned?

...!ihnp4!alberta!auvax!rwa	Ross Alexander @ Athabasca University

ford@crash.UUCP (06/14/87)

In article <184@auvax.UUCP> rwa@auvax.UUCP (Ross Alexander) writes:
>I can think of only one little quibble with the above:  C arrays are
>considered to begin with the zero'th element and end at the N-1'th
>element, right?  So to declare something as 
>
>	int	array[ short ];
>
>implies that array[] has members indexed by -1, -2, -3, and so on

No.  A supscript must simply be some sort of integral type which the copmiler
can perform multiplication on or stick into an index register (i.e. it must
be something you can add to a pointer).  In particuar, pointers may be
dereferenced with a negative subscript.  According to K&R's C Reference
Manual, section 14.3:

	"By definition, the subscript operator [] is interpreted in such a
	way that  E1 [ E2 ]  is identical to  * ( (E1) + (E2) )."

This means that, since negative numbers can be added to pointers, it is
possible to reference negatively indexed members of an array.  This would
only be useful if you had a pointer to the middle of an array, just as it
is only useful to use subscripts less than the size of an array.  As an
example, consider the code to remove a newline from the end of a string:

	void trim(string)
	char *string;
	{
	    while (*string)
		++string;
	    if ( string[-1] == '\n' )
		string[-1] = '\0';
	}

This code is optimized for situations where the newline will not usually be
present, otherwise the first string[-1] would be *(--string) and the second
would be *string.  This is because using a non-zero subscript is usually
faster than actually decrementing string, but decrementing it is faster than
using the non-zero subscript twice.  Note that for simplicity, the example
does not properly handle the case where a zero-length string is passed.  It
would reference the byte immediately before the string in memory, which is
undefined in general.

> [...]  Say I declare an array
>
>	int	demo[ 21 : -10 ];
>
>which I define to mean 'an array of ints, called demo, 21 ints long,
>first element is demo[ -10 ]'.

This is not a bad idea, but I would prefer something like the pascal syntax:

	int	demo[ -10 .. 10 ];

>Now, what is the value of the token 'demo' ?

I think it should be the address of the FIRST element in demo (i.e.
demo[-10]).  This allows such idiomatic constructs as

	write(fd, demo, sizeof demo);

which would become very non-intuitive if it were necessary to specify the
starting or ending indices.

>                                    And further, when I then code up
>something like 'frobozz( 2, "hi there", demo );' and frobozz() looks
>like
>
>	double frobozz( a, b, c );
>		int a;
>		char * b;
>		int c[];
>	{ /* and so on... */
>
>what is the allowable range of indicies of c[] ? -10 .. +10 ?  0 ..
>20 ?
>

The problem here is that the declaration "int c[]" is misleading, it implies
that an array is passed to this function, when only a pointer is passed.  It
makes sence to declare c as "int *", and then you can pass the address of
whichever element of demo[] you like, with "demo" alone refering to the add-
ress of the FIRST (not necessarily the zero-th) element.

I see no real reason to have negative (or even specifiable) ranges of indices
for arrays.  This is another "educational" feature of pascal that detracts
from C's philosophy of being a 'high-level assembly language', i.e., C is
meant to provide a portable representation for operations present in most
machines' instruction sets.  But negative subscripts to POINTERS can be very
useful, almost always has a direct mapping to machine instructions, and is
already used in a lot of code that I have seen, including most C compilers'
C libraries (in things like strcat, strtok, et al.).

Michael "Ford" Ditto				-=] Ford [=-
P.O. Box 1721					ford@crash.CTS.COM
Bonita, CA 92002				ford%oz@prep.mit.ai.edu
-- 

Michael "Ford" Ditto				-=] Ford [=-
P.O. Box 1721					ford@crash.CTS.COM
Bonita, CA 92002				ford%oz@prep.mit.ai.edu

edw@ius2.cs.cmu.edu.UUCP (06/14/87)

In article <1226@crash.CTS.COM>, ford@crash.CTS.COM (Michael Ditto) writes:
> 
> > [...]  Say I declare an array
> >
> >	int	demo[ 21 : -10 ];
> >
> >which I define to mean 'an array of ints, called demo, 21 ints long,
> >first element is demo[ -10 ]'.
> 
> This is not a bad idea, but I would prefer something like the pascal syntax:
> 
> 	int	demo[ -10 .. 10 ];
> 
> >Now, what is the value of the token 'demo' ?
> 
> I think it should be the address of the FIRST element in demo (i.e.
> demo[-10]).  This allows such idiomatic constructs as
> 
> 	write(fd, demo, sizeof demo);
> 
> which would become very non-intuitive if it were necessary to specify the
> starting or ending indices.

	We have a problem here.  If demo is actually the address of
the first element of the array then what about the code generated for
a = demo[-10] presuming that you what to index over the subrange.
Does the compiler have to keep around subrange values so the code generated
for the above is

		

	a = *(demo + index - range_base)
> 
> >                                    And further, when I then code up
> >something like 'frobozz( 2, "hi there", demo );' and frobozz() looks
> >like
> >
> >	double frobozz( a, b, c );
> >		int a;
> >		char * b;
> >		int c[];
> >	{ /* and so on... */
> >
> >what is the allowable range of indicies of c[] ? -10 .. +10 ?  0 ..
> >20 ?
> >
> 
> The problem here is that the declaration "int c[]" is misleading, it implies
> that an array is passed to this function, when only a pointer is passed.  It
> makes sence to declare c as "int *", and then you can pass the address of
> whichever element of demo[] you like, with "demo" alone refering to the add-
> ress of the FIRST (not necessarily the zero-th) element.

	This is how you propose to solve the problem of indexing
for subranges in C.  Why bother putting them into the language in 
the first place if you can't subscript over them in the most natural way
ie.

		for (i = -10; i <= 10; i++)
			sum += c[i];

  The better solution would be to require the subrange declaration
for c. ie	
		int c[-10..10];

> 
> I see no real reason to have negative (or even specifiable) ranges of indices
> for arrays.  This is another "educational" feature of pascal that detracts
> from C's philosophy of being a 'high-level assembly language', i.e., C is
> meant to provide a portable representation for operations present in most
> machines' instruction sets.  But negative subscripts to POINTERS can be very
> useful, almost always has a direct mapping to machine instructions, and is
> already used in a lot of code that I have seen, including most C compilers'
> C libraries (in things like strcat, strtok, et al.).
> 
> Michael "Ford" Ditto				-=] Ford [=-
> P.O. Box 1721					ford@crash.CTS.COM
> Bonita, CA 92002				ford%oz@prep.mit.ai.edu

   Points to be made.

	If subranges are added to the language then

	1) the indexing code for arrays would be let efficient

		array base address + index - base subrange
		(constant folding can remove some inefficencies
		 but not all - if the array is an automatic var
		 the address isn't known until the function is called)

	2) would require the code generation to be a lot smarter about
	   array indexing

		can nolonger assume the base index will start at the
		integer value 0 - indexing subrange information
		on arrays would have to be keep around.

	3) make the language more strongly typed

		To properly handle the frobozz example, the subrange
		information has to be present.   Hence, one could
		say that int ??[-10..10] is a new type and to 
		insure proper preformance only variables of the
		same type should be given as input parameters.

-- 
					Eddie Wyatt

e-mail: edw@ius2.cs.cmu.edu

rwa@auvax.UUCP (Ross Alexander) (06/15/87)

In article <1226@crash.CTS.COM>, ford@crash.CTS.COM (Michael Ditto) writes:
> In article <184@auvax.UUCP> rwa@auvax.UUCP (Ross Alexander) writes:
> >	int	array[ short ];
> >
> >implies that array[] has members indexed by -1, -2, -3, and so on
> 
> No.  A subscript must simply be some sort of integral type which the copmiler
> can perform multiplication on or stick into an index register (i.e. it must
> be something you can add to a pointer).

yes, yes, we all know that.  float and double subscript values are
strictly for BASIC, right :-) ??  what i was driving at was "how much
storage is the compiler going to ALLOCATE for array[ short ] ?", and
the corollory question is "what is the value of the index of the
first and last allocated elements of array[]?".

by analogy with Pascal, one might expect the first allocated element
to have a negative index value, since 'short' is a signed type, and
is really "MIN_SHORT_VALUE ..  MAX_SHORT_VALUE" when you think of it
as a subrange type (subrange of int).  on 'typical' architectures,
this might be -0x8000 ..  0x7fff or whatever.  this was my meaning
(poorly expressed, i grant you) when i said 'indexed by -1, -2, -3,
and so on'.  at any rate, the array will have (some) elements indexed
by negative values, and (some) by non-negative values.

now, getting back to the can of worms (smile, gentle people), say i
declare

	int	char_count[ char ];

(I want to count character frequencies in some input, or something),
and further let us say that chars are signed, so this declaration is
equivalent to

	int	char_count[ MIN_CHAR .. MAX_CHAR ];

with MIN_CHAR == -128, and MAX_CHAR == 127.  (As an aside, this is
precisely the case locally).

now, you suggest that the token 'char_count' would have the value '&
char_count[ MIN_CHAR ]'.  (actually, it's a pointer constant, but
never mind).  then i can say that '* ( char_count + 3 )' references
an int three int's further on in memory than '& char_count[ MIN_CHAR
]'.  ok?  and of course, 'char_count[ 3 ]' is the third int after
'char_count[ 0 ]', which to my way of thinking is in the middle of
the array.  but we get into a small problem when we observe that '* (
char_count + 3 )' and 'char_count[ 3 ]' are formally equivalent, but
talk about entirely different places (whoops!!).  this _reductio ad
absurdem_ leads me to think that fooling with the current semantics
of the pointer constant 'char_count' is not a good idea, i.e., it
should still mean '& char_count[ 0 ]' regardless of the fact that
happens to be the middle of the array.  And you yourself observed
how much grief that is going to cause.

has anybody got a bright idea?  i am (reluctantly) forced to think
that fiddling with the semantics of C array declarations is not wise.

as any aside,  any one have a good way to introduce sets into C?  we
already have enum's,  i really would like to have sets and set
operations.  i know that you can fake it for small sets ( <= number
of bits in a 'long' ) very easily, but it gets ugly after that.  and
writing

	enum days { sunday, monday, /* ... */, friday, saturday };
	long days_off;
	int x;

	days_off = ( 1 << sunday ) | ( 1 << saturday );
	if ( ( days_off & ( 1 << x ) ) != 0 ) { /* do something */ };


is a little less intuitive than ( i create syntax freely here, the
semantics are pretty much the same )

	set days_off[ days ];

	days_off = [ sunday, saturday ];
	if ( days_off[ x ] ) { /* do something */ };

i don't like the notation especially, but...

...!ihnp4!alberta!auvax!rwa		Ross Alexander, Athabasca University

grk@sfsup.UUCP (06/15/87)

[Original article not included for brevity--type `p' if you want to see it]

There have been several people who suggested that it would be nice if C had
non-zero origined arrays.  How about the following (albeit kludgy)
solution:

	int a[21];	/* really meant to be a[ -10 .. 10 ] */
	int *na;	/* na stands for new-a, good software engineering
			   type descriptive variable name :-) */

	na = &a[10];	/* na[-10] is now a[0], na[0] is a[10], etc. */

--Ralph
-- 
	G. Ralph Kuntz N2HBN	UUCP: {ihnp4,allegra}!attunix!grk
				ARPA: rutgers.rutgers.edu!pisc2b!grk
				PACKET: N2HBN @ NN2Z

franka@mmintl.UUCP (Frank Adams) (06/15/87)

In article <1226@crash.CTS.COM> ford@crash.CTS.COM (Michael "Ford" Ditto) writes:
|In article <184@auvax.UUCP> rwa@auvax.UUCP (Ross Alexander) writes:
|>I can think of only one little quibble with the above:  C arrays are
|>considered to begin with the zero'th element and end at the N-1'th
|>element, right?
|
|According to K&R's C Reference Manual, section 14.3:
|
|	"By definition, the subscript operator [] is interpreted in such a
|	way that  E1 [ E2 ]  is identical to  * ( (E1) + (E2) )."
|
|	int	demo[ -10 .. 10 ];
|...
|>Now, what is the value of the token 'demo' ?
|
|I think it should be the address of the FIRST element in demo (i.e.
|demo[-10]).

But this is incompatible with the quotation from K&R, above.  'demo' would
have to mean '&demo[0]' for this definition to work.
-- 

Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
Ashton-Tate          52 Oakland Ave North         E. Hartford, CT 06108

karl@haddock.UUCP (Karl Heuer) (06/16/87)

In article <184@auvax.UUCP> rwa@auvax.UUCP (Ross Alexander) writes:
>In article <526@haddock.UUCP>, karl@haddock.UUCP (Karl Heuer) writes:
[proposal for a new syntax allowing a type name instead of an array size]
>
>I can think of only one little quibble with the above:  C arrays are
>considered to begin with the zero'th element and end at the N-1'th
>element, right?  So to declare something as 
>	int	array[ short ];
>implies that array[] has members indexed by -1, -2, -3, and so on
>('short' is a signed type, no?).

Agreed.  The set of valid subscripts would be -32768 to +32767 (on a vax).

>So Karl will also have to think of some way to introduce the declaration of
>array index bases, as well as array index ranges, a la Algol-60 or Pascal;

What for?  The notation "int array[short]" does that implicitly, by (my)
definition.  The more general case may also be useful, but that can be an
independent issue.  (And the best solution may turn out to be pascal-like
subrange types.)

>and this then introduces a rather large can of worms when pointers to based
>arrays are passed around.

No problem.  You can already pass arrays around without knowing the base or
size: "int a[10]; p = a+3;" gives you a pointer whose valid subscripts are -3
to +6.

>Say I declare an array [named demo, size 21 ints, based at -10].  Now, what
>is the value of the token 'demo' ?  Is it '&demo[-10]'?  Is it '&demo[0]'?

If the definition of subscripting is to remain intact, demo must be equal to
&demo[0].  This does create some minor problems, but then so does the original
idea of an enum-indexed array in a strongly typed world.  It may be that this
can only be implemented properly by making the array a real datatype.

>As a quick out (because I _do_ like the basic idea), how about
>saying that subscript types must be unsigned?

You get all of the same problems with "enum foo { X=11, Y=12, Z=13 }".

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

karl@haddock.UUCP (Karl Heuer) (06/16/87)

In <516@haddock.UUCP> and <526@haddock.UUCP> I suggested the possibility of a
type-indexed array, the particular point of interest being enums:
    enum color { RED, GREEN, BLUE };
    int hue[enum color] = { RED: 0x00F, GREEN: 0x0F0, BLUE: 0xF00 };
This is supposed to create an array which must be subscripted by enum color;
using an int subscript would be illegal.  The idea was that this could make
enums more useful in the strong model.  (In the strong model, enums are
separate types as opposed to disguised integers.)

Somehow I overlooked the fact that this is at odds with the definition of the
subscript operator.  If "hue[RED]" is an int, and is the same as "*(hue+RED)",
then "hue" must have a type T such that T+enum yields "int *".  There is no
such type.  (Logically, it's the type you get if you subtract an enum from a
pointer, but this is an illegal operation in the strong model.)

I see a few ways to salvage this:

[0] Accept the weak model of enums.  In this case we aren't adding anything.
(If enums are just ints, they can already be used as subscripts; the only
advantage is the notational convenience in declaring the array.)  We lose the
typechecking (detection of non-enum subscripts).

[1] Legalize the operation "pointer - enum".  You can add an integer to this
type, yielding the same type.  You cannot dereference it.  You can add an enum
(of the correct type), yielding a normal (dereferencable) pointer.  This seems
quite kludgy, but may be workable.  (How do you declare an object of this new
type?)

[2] Discard the idea that a[i] means *(a+i), and the automatic conversion of
arrays to pointers.  This is probably best in the long run, but is a rather
substantial change in the language, and would have to be phased in.

>[I] would like to see it implemented as an extension.  If it catches on, it
>may be a candidate for the __STDC__==2 release.

Suddenly, even __STDC__==3 seems optomistic.  Unless I've missed something?

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

guy@gorodish.UUCP (06/16/87)

> [2] Discard the idea that a[i] means *(a+i), and the automatic conversion of
> arrays to pointers.  This is probably best in the long run, but is a rather
> substantial change in the language, and would have to be phased in.

Note that C++ has implicitly acknowledged that the "a[i] means
*(a+i)" model isn't always right; when you add a new data type you
can define a subscripting operator for it.  I agree that it would be
best in the long run, but one is reminded of Keynes' remark about the
long run....
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

karl@haddock.UUCP (Karl Heuer) (06/19/87)

In article <21213@sun.uucp> guy%gorodish@Sun.COM (Guy Harris) writes:
>[karl@haddock.isc.com (Karl Heuer) writes:]
>>[2] Discard the idea that a[i] means *(a+i), and the automatic conversion of
>>arrays to pointers.  This is probably best in the long run, but is a rather
>>substantial change in the language, and would have to be phased in.
>
>Note that C++ has implicitly acknowledged that [this model] isn't always
>right; [you can redefine subscripting for new types].  I agree that it would
>be best in the long run, but one is reminded of Keynes' remark [...]

So maybe we should just shorten the long run.  If we want to phase in this
change, I would expect that the first step would be to redefine subscripting
and array/pointer conversion as follows:

3.2.2.1 [Conversion of] Arrays, functions and pointers
Add the following paragraph:
    When used as an operand where an array is required, an expression that has
    type "pointer to _type_" is converted to an expression that has type
    "array of _type_ with unknown number of elements" whose initial member is
    pointed to by the pointer expression.

3.3.2.1 Array subscripting
Change the "constraints" paragraph:
    The left expression shall have type "array of _type_", the right
    expression shall have integral type, and the result has type "_type_".
Change the first "semantics" paragraph:
    E1[E2] designates the E2-th member of E1 (counting from zero).  Because of
    the conversion rule that applies to the subscripting operator, if E1 is a
    pointer to the initial member of an array object, E1[E2] is identical to
    (*(E1+(E2))).

This is just the reverse of the current wording*; the notational equivalence
of brackets and star-plus becomes a consequence instead of a definition.  I
claim that this change (along with appropriate rewording elsewhere in the
document) would not affect the behavior of the abstract machine at all.**
(Am I right?)  Thus, this change would be transparent to existing programs and
could be made immediately.

The hard part would be deprecating the use of "a" when "&a[0]" is meant.

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
--------
*Note that X3J11 has already done something very similar by changing the
definition of the function-call operator: instead of the left expression being
a function locator, and allowing the use of non-dereferenced function pointers
by implicitly dereferencing them, they asserted that the expression is
*always* a function pointer, and that the normal usage implicitly takes the
address of the function locator.  Personally I think they went the wrong
direction; I'd rather have functions and function pointers be distinct, and
require the "&" as well as the "*".

**Well, actually my wording does forbid the barbaric construct i[a], but this
could be either allowed-but-deprecated or a Common Extension.