[comp.std.c++] Array References: request for comments

dl@g.g.oswego.edu (Doug Lea) (10/20/90)
I'm considering submitting the following to X3J16 (ANSI C++). But
before I do, I'd like to get more feedback. All comments would be
welcome, in an attempt to make this as strong and useful a proposal
as possible. Please email replies directly (dl@g.oswego.edu). 

Thanks!

-----------------
% Format with latex

\documentstyle{article}
\begin{document}
\title{{X3J16 Proposal:} \\  {Array References}}
\author{Doug Lea}
%\date{}
\maketitle

\section{The Problem}

In C and C++, a {\tt T*} may ambiguously refer to either a pointer to
a single object, or a pointer to the first of an array of objects.
(Throughout this proposal {\tt T} is a stand-in for any first-class
C++ type.)

This ambiguity is alternately considered to be among C's major
strengths, in that it allows the total integration of pointers and
arrays, and as its greatest weakness in providing the infrastructure
for C++, in that it compromises type safety and impedes optimizability
of the kinds of high-level constructs that are much more
characteristic of C++ than C programs.

\subsection{Example 1}

\samepage{\begin{verbatim}
class B    
{ 
  public: int bvar;  virtual int f() { return 0; } 
};
\end{verbatim}}
\samepage{\begin{verbatim}
class D : public B 
{ 
  public: int dvar;  int f() { return 1; } 
};
\end{verbatim}}
\samepage{\begin{verbatim}
void useB(B* p) { int i = p[1].f() + p[1].bvar; }
\end{verbatim}}
\samepage{\begin{verbatim}
main()
{
  B single_b;        useB(&single_b);  // (1)
  B array_of_b[10];  useB(array_of_b); // (2)
  D single_d;        useB(&single_d);  // (3) 
  D array_of_d[10];  useB(array_of_d); // (4)
  B* bp = 0;         useB(bp);         // (5)
}
\end{verbatim}}

The safety problems are that a client (caller) of {\tt useB} cannot
tell by {\tt useB}'s signature whether any or all of
\begin{enumerate}
\item A pointer to a single B
\item An array of B's
\item A pointer to a single D
\item An array of D's
\item A null pointer
\end{enumerate}
\noindent 
are legal arguments to {\tt useB}. As it turns out, only an array of
B's is really supported: Calls (1) and (3) are unsupported because
{\tt useB} attempts to treat the scalars {\tt single\_b} and {\tt
single\_d} as arrays, and call (4) is unsupported because C++ arrays
are necessarily {\em homogeneous}, (i.e., filled with elements all of
the same exact class) requiring that the declared type ({\tt B}) be
treated as an {\em exact} type when the compiler determines the byte
offset (via {\tt sizeof(B)}) of the element {\tt p[1]}. Since
{\tt sizeof(B) != sizeof(D)}, the computed offset is incorrect. Call
(5) is unsupported since {\tt useB} does not check whether the pointer
is null.

Thus calls (1), (3), (4) and (5) are type-legal but wrong.  There ought to
be a way for programmers to express such intentions.  This is a more
serious type safety compromise in C++ than in C, since ambiguities due
to subclassing are not present in C.

\subsection{Example 2}
\samepage{\begin{verbatim}
class intVec
{
  int  sz;  int* data;
public:
       intVec(int s)      :sz(s), data(new int[sz]) {}
      ~intVec()           { delete [] data; }
  int& operator[] (int i) { return data[i]; }
  int  length()           { return sz; }
};
\end{verbatim}}

\samepage{\begin{verbatim}
void clear(intVec& v) 
{ 
  for (int i = 0; i < v.length(); ++i) v[i] = 0; 
}
\end{verbatim}}

Any C++ compiler attempting to perform standard optimizations on {\tt
clear} will be bothered by a fundamental obstacle. Even though {\tt
length()} is inlined, because {\tt v.data} is allowed to point to {\em
any} int, a compiler must conservatively assume that {\tt v.data} {\em
could} point to {\tt v.sz}, that could in turn be set to zero within
the assignment statement. This means that {\tt v.sz} must be retrieved
from memory on each iteration rather than being kept in a register, or
more importantly, being used as a loop constant that would allow this
code to be transformed into the essentially optimal C-like
\samepage{\begin{verbatim}
{
  int* p = v.data; 
  int* fence = &v.data[v.sz]; 
  while (p != fence) *p++ = 0;
}
\end{verbatim}}
\noindent
or even a vectorization of this loop on a machine supporting such
things.

Among the basic efficiency differences between C and C++ programming
is that C programmers who are interested in guaranteeing efficient
performance can always manually rewrite functions like {\tt clear} in
such a manner themselves.  However, because of data encapsulation, C++
programmers cannot obtain low-level data access. Since only compilers
are allowed to (safely) break encapsulation, programmers must rely on
compilers to optimize functions much more than C programmers do. Thus,
optimizability issues will continue to play a far greater role in C++
than they have in C.

Programmers and libraries creating utility classes like {\tt intVec}
will discover surprising and intractable efficiency losses for doing
so unless these issues are addressed. This is an increasingly important
issue for some users. Programmers will not always heed the standard
advice (e.g., ARM, p 137) to create and use array-like classes instead
of C constructs if they obtain noticeably slower code when doing so.

It is easy to concoct larger examples where this kind of problem
causes much more devastating optimization failures.  This is not the
sort of issue that can {\em always} be solved by merely creating
cleverer compilers that attempt to discern for themselves whether a
variable is serving as a pointer to a scalar versus an array.
Moreover, even if compilers were to be able to handle the majority of
such cases via advanced dataflow techniques, the programmer type
safety issues remain: It is useful for the {\tt intVec} author to make
crystal clear to other programmers, not just the compiler, that {\tt
data} must point to an array, not a scalar.

Programmers who desire greater type safety and optimizability
need to have a mechanism to express the necessary constraints in a
manner that does {\em not} require any other changes to C++, so that
existing code continues to be accepted.

\section{A Solution}

Perhaps surprisingly, C++ already possesses the basic construct needed
to solve many of these problems, the array reference.  

It is legal to declare references to arrays in either of two forms:
\begin{itemize}
\item  ``Open array references'' that do not specify array capacity, 
    denoted via {\tt T (\&open\_ref)[]}
\item ``Closed array references'', that do specify capacity,
    denoted via {\tt T (\&closed\_ref)[N]}, where {\tt N} is
    a {\em compile-time} constant).
\end{itemize}

The fact that the syntax for declaring and using array references is
very awkward may have somehow contributed to the fact that their basic
semantics and properties have apparently never been fully spelled
out.\footnote{Readers of the present proposal may find many syntactic
constructs difficult to immediately grasp. My only defense is that I
did not design this syntax.}

The present proposal examines these issues and proposes answers in ways
that maximally improve both type safety and optimizability, while also
maintaining perfect back-compatibility.  If these issues are settled,
perhaps programmers will become accustomed to creating simple {\tt
typedefs} that eliminate some of their awkwardness.

This proposal does not go so far as to suggest particular wording
changes of particular sections of the draft ARM. Instead, the basic
issues are surveyed, and precise rules and their consequences are
discussed. It should be a relatively easy matter to integrate these
into the corresponding sections of the standard.

In fact, with only a few (indicated) exceptions, there is nothing in
this proposal that directly contradicts the ARM. Thus, it may be
viewed as merely an interpretation of the utility of array references
and their implementation.

In other words, regardless of agreement on the details of this
proposal, the semantics of array references probably need to be
addressed by X3J16, since (1) they are not otherwise completely
defined in the ARM, (2) current C++ compilers cannot be depended on to
provide any consistent semantics for this construct, and (3) if, say,
only one C++ compiler were to implement the interpretation discussed
here, and programmers were to regularly exploit this support, serious
incompatibilities could result.

\section{Representational Properties}

In C++, a reference is merely an ``alias''. As discussed, for example,
in the ARM, p 153, declaration of a reference does not automatically
entail the creation of a behind-the-scenes pointer requiring
dereferencing. In fact array references {\em never} require extra
indirection in order to access the base of the referenced array. Any
{\tt T (\&)[]} or {\tt T (\&)[N]} may be internally represented in
precisely the same fashion as a as {\tt T param[]} or {\tt T*}, i.e.,
via a pointer to the first cell of the array.\footnote{While it is not
clear to me whether X3J16 can mandate this kind of economical
representation of array references, an important aspect of the present
proposal is that the advantages of array references may be obtained
without additional overhead over other forms.}

For example, all three of the following might generate precisely the
same machine code:
\samepage{\begin{verbatim}
void strcpy1(char* s, const char* t)
{ for (int i = 0; t[i] != 0; ++i) s[i] = t[i]; }
\end{verbatim}}
\samepage{\begin{verbatim}
void strcpy2(char s[], const char t[]) 
{ for (int i = 0; t[i] != 0; ++i) s[i] = t[i]; }
\end{verbatim}}
\samepage{\begin{verbatim}
void strcpy3(char (&s)[], const char (&t)[])
{ for (int i = 0; t[i] != 0; ++i) s[i] = t[i]; }
\end{verbatim}}

The only programmer-visible difference is type safety. For example, it
should be illegal to write {\tt char c; strcpy3(\&c, \&c);}.

While (as discussed below) {\tt T param[]} and {\tt T* p} are often
used interchangeably, the otherwise parallel constructs {\tt T
(\&aref)[]} and {\tt T* (\&ptr)} are representationally distinct,
and thus require stronger type enforcement.

\section{First Class Status}

Array references share many similarities with the C {\tt T param[]}
construct (i.e., an array declared as a function parameter, as in {\tt
strcpy2}). They differ in two important respects: While {\tt T
param[]} is only legal in parameter declarations (and a few
other miscellaneous spots), and can almost always be used
interchangeably with {\tt T*}, array references are legal wherever any
kind of reference is legal, and cannot be interpreted as equivalent to
{\tt T*}.

Use of array references clears up some problems encountered in the
above examples:

\subsection{Example 1, revisited}
\samepage{\begin{verbatim}
void useB2(B (&a)[]) { int i = a[1].f() + a[1].bvar; }
\end{verbatim}}
\samepage{\begin{verbatim}
main()
{
  B single_b;        useB2(&single_b);  // (1) ERROR
  B array_of_b[10];  useB2(array_of_b); // (2)
  D single_d;        useB2(&single_d);  // (3) ERROR
  D array_of_d[10];  useB2(array_of_d); // (4) ERROR
  B* bp = 0;         useB(bp);          // (5) ERROR
}
\end{verbatim}}

Here, the calls that were previously legal but unsafe would be caught
at compile time, assuming implementation of the declaration and
matching rules to be described below.

Additionally, an interesting minor optimization can be performed.
Since C++ arrays are necessarily homogenous, the virtual call {\tt
a[1].f()} can be resolved at compile time. It must be B's version of
{\tt f()} being called, not any other.

\subsection{Example 2, revisited}

\samepage{\begin{verbatim}
class intVec
{
  int  sz;  int  (&data)[];
public:
       intVec(int s)      :sz(s), data((int(&)[])(new int[sz])){}
      ~intVec()           { delete [] &(data[0]); }
  int& operator[] (int i) { return data[i]; }
  int  length()           { return sz; }
};
\end{verbatim}}

\samepage{\begin{verbatim}
void clear(intVec& v) 
{ 
  for (int i = 0; i < v.length(); ++i) v[i] = 0; 
}
\end{verbatim}}

Since {\tt v.data} must {\em not} refer to a scalar, the optimization
otherwise missed inside of {\tt clear} can be handled.

(The odd mechanics of integrating array references with the {\tt new}
and {\tt delete} operators are addressed below.)

\section{Declaration and type-matching issues}

Examples have so far only used open array references. However, closed
array references are also occasionally useful. In fact, the two
constructs differ enough that separate sets of rules must be created
to handle them.

By definition, an open array reference can serve as an alias for an
array with any number of elements, with the actual number of elements
often unknown to the programmer. Declaration of a closed array
reference represents a stronger commitment on the part of the
programmer. Since, to remain backwards compatibity, closed array
references must only use compile-time constants for array bounds, they
are less often usable.\footnote{Their utility could be marginally
improved by extending the typesafe C++ {\em conformance} type matching
rules to array references, so that a reference to, say 100 ints could
be attached to an array of 200 ints. Unfortunately, this strategy
conflicts with the assumptions of C matrix operations.}

Since the little-used {\tt T param[]} construct represents a weaker
version of array references, one could create rules to enable
transformation from one form to the other in those cases where it is
unambigously safe.  However, because the {\tt T param[]} construct
cannot currently be depended on be considered distinct from {\tt T*}
(C and C++ compilers traditionally treat them as synonyms) a simpler
strategy is to formalize this type equivalence (as is implicit in the
ARM, p 138, as well as in ANSI C). This then allows simpler rules for
dealing with array references, as opposed to either construct.

\subsection{Rules}

Given these considerations, the necessary rules for declaring array
references should be clear:

An open array reference, {\tt T (\&r)[]} may alias 
\begin{itemize}
\item any array object {\tt T a[N]}, or
\item an array object referenced by a closed array reference, 
    {\tt T (\&cr)[N]}, or
\item an array object referenced by another {\tt T (\&or)[]};
\end{itemize}
\noindent
where {\tt T} must be an {\em exact} type match, not a subclass,
because of C++ homogeneity restrictions on arrays.

A closed array reference, {\tt T (\&r)[N]} may alias
\begin{itemize}
\item an array object {\tt T a[N]}, or
\item an array object referenced by another {\tt T (\&cr)[N]},;
\end{itemize}
\noindent
again with the exactness restriction.

No provision is made for implicitly coercing an open to a closed array
reference.  As is always the case, explicit coercions are available,
and might even be useful by those who are aware of their inherent
dangers.  (Casts and coercions are, of course, dangerous, because
compilers are generally allowed to {\em believe} in the veracity of
declared types, which might lead to transformations and optimizations
that fail to make sense for a casted object.)

No additional rules are required to support multidimensional arrays or
matrices.  None of these rules are inconsistent with the ARM. Indeed,
some are already implicit in other definitions and rules in the ARM,
and appear consistent with at least some of the observed behavior
of some C++ compilers.

\subsection{Relation to {\tt T*}}

C requires that a {\tt T*} pointer may be implicitly coerced into a
{\tt T []}, and vice versa. But there is every reason to resist 
implicit coercions between {\tt T*} and array references. 


In passing, it may be noted that coercions may be used in order
to create references to subarrays:

\samepage{\begin{verbatim}
  T a[200];
  T (&suba)[50] = (T (&r)[50])(&a[150]);
\end{verbatim}}

\subsection{Overloading}

Overloading sensitivity and resolution rules, and type matching
generally, should be governed by the above requirements. For example,
function {\tt void f(int (\&a)[])} would be distinguishable from void
{\tt f(int* p)} (which is, according to the ARM, p138, the same
function as {\tt f(int param[])}). 

While sound, this has a surprising effect:
\samepage{\vspace{-0.5pt}\begin{verbatim}
  extern void f(int*);
  extern void f(int (&)[]);
  int x[100];  
  f(x);                 // ambiguous
  int (&rx)[100] = x; 
  f(rx);                // call array reference version
\end{verbatim}}

Unless changes in the overload resolution policies are made,
programmers will have to refrain from overloading pointer versions of
functions in programs using array references.

Such issues have interesting consequences for C++ library
standardization.  For example, should the C library version of {\tt
strcpy} have a C++ prototype along the lines of {\tt strcpy1} or of
{\tt strcpy3} above?

\section{Assignment}

In C++, reference assignment (as opposed to initialization) always
entails {\em deep copy} semantics, which in the case of array references,
imply the need for array copies. However, this conflicts with
the set of C rules that prevent array copies from ever automatically
occuring, even in those cases where the bounds of the copy are
determinable at compile time.

Hence, the most conservative strategy is to simply make array
reference assignments illegal, in the same manner that array
assignments are illegal. For example,
\samepage{\vspace{-0.5pt}\begin{verbatim}
  T x[100];   T (&rx)[100] = x; 
  T y[100];   T (&ry)[100] = y;
  x  = y;        // ERROR
  rx = y;        // ERROR
  x  = ry;       // ERROR  
\end{verbatim}}

\section{Array Pointers}

Array references fail to clear up safety and efficiency
issues in those cases where pointers are required instead of
references.

Just as it is legal to define a reference to an array, it is legal
(and again, awkward) to define a pointer to an array via
\begin{itemize}
\item  ``Open array pointers'' that do not specify array capacity, 
    denoted via {\tt T (*open\_ptr)[]}
\item ``Closed array pointers'' that do specify capacity,
    denoted via {\tt T (*closed\_ptr)[N]}, where {\tt N} is
    a {\em compile-time} constant).
\end{itemize}

These have similar representational properties as array references.
They do not require additional indirection.

The most important situation in which array pointers are required
is in dealing with allocation and destruction of arrays of objects.
The C++ definitions of the ``vector'' versions of {\tt new} and {\tt
delete} operators accentuate the scalar versus array pointer ambiguity
found in C.  While {\tt vector\_new} actually returns a pointer to an
array of objects ({\tt T (*)[]}), it is typed as returning a simple
pointer ({\tt T*}).  Similarly, the argument for {\tt vector\_delete}
should be ({\tt T (*)[]} but is listed as {\tt T*}. 

This type laxity allows many common programming errors to remain uncaught
by compilers.  For example, in
\samepage{\vspace{-5.0pt}\begin{verbatim}
  B* bp = new D[100]; /* ... */  delete [] bp;
\end{verbatim}}
\noindent
the base ({\tt B}) element size step would be {\em incorrectly} used
to traverse through the array of {\tt D}s while calling destructors, with
indeterminate (but surely unpleasant) run time consequences.

Additionally, current definitions require the kinds of machinations
seen in the {\tt intVec} constructor and destructor in order to
``coerce'' arguments into their logically correct form.

The solution to this is first solidify rules for array pointers, and
then to correctly define the type of {\tt vector\_new} and {\tt
vector\_delete}.

\subsection{Rules}

Rules for declaring and initializing array pointers should correspond
exactly to those for references:

\begin{itemize}
\item An open array pointer {\tt T (*p)[]} may point to anything that
    an open array reference {\tt T (\&r)[]} may refer to.
\item A closed array pointer {\tt T (*p)[N]} may point to anything that
    a closed array reference {\tt T (\&r)[N]} may refer to.
\end{itemize}

For example,
\samepage{\vspace{-5.0pt}\begin{verbatim}
  T x[100];
  T y[200];
  T (*p)[] = &x;      // p points to x
  *p = 17;            // ERROR
  (*p)[0] = 17;       // set x[0] = 17
  T (*q)[100] = &x;   // q points to x
  (*q)[1] = 2;        // set x[1] = 2
  T (&r)[] = y;       // r refers to y
  p = &r;             // repoint p to y
  (*p)[2] = 49;       // set y[2] = 49
  q = p;              // ERROR
  p = q;              // OK
\end{verbatim}}

Related issues, including overloading sensitivity and assignment,
are also identical to those listed described for array references.

\subsection{Relation to {\tt T*}}

While, because of lack of existing precedent, there is no need to
specially consider the intercoercibility of array references
and simple pointers, there is a great deal of C++ code that relies
on the representational equivalence and lack of type discernment
between {\tt T*} and {\tt T (*)[]}. 

Requiring that these forms not freely intercoerce is clearly most
defensible. The greatest impact would be on {\tt vector\_new} and {\tt
vector\_delete}. A new set of rules, conflicting only in type details
with the existing ARM (e.g., p60), are needed stating that
\begin{itemize}
\item The form {\tt new T[ expression ]} invokes {\tt vector\_new}
    and returns a {\tt T (*)[]}.
\item The form {\tt delete[] arg} requires that {\tt arg} have exact type
    {\tt T (*)[]} and calls {\tt vector\_delete}, returning {\tt void}.
\item {\tt vector\_new} and {\tt vector\_delete} are not overloadable.
\end{itemize}

For example, given this, the {\tt intVec} constructor and
destructor could be rewritten intelligibly, in a manner perfectly
analogous to those allocating and freeing scalars:
\samepage{\begin{verbatim}
class intVec
{
//...
       intVec(int s)     :sz(s), data(*(new int[sz])) {}
      ~intVec()          { delete [] &data); }
//...
};
\end{verbatim}}

These changes would invalidate existing code of the form
\samepage{\vspace{-5.0pt}\begin{verbatim}
  T* p = new T[100]; /* ... */  delete [] p;
\end{verbatim}}
\noindent

There are two possible ways to deal with this:
\begin{enumerate}
\item Phase in the changes.
\item Allow {\tt T*} and {\tt T(*)[]} to silently, ``trivially'' intercoerce.
\end{enumerate}

Choosing the second option negates {\em all} type safety advantages.

\section{Homogeneity and type safety}

As discussed in my (mainly unrelated) ``Customization in C++'' paper
(1990 USENIX C++ conference), it is difficult in C++ to express the
fact that a function is {\em intentionally} type-specific, properly
working for a particular class, but not necessarily for subclasses.
As noted above, this is among the issues successfully addressed
by the the use of array references in the special case of functions
requiring homogeneous array arguments.

The most common and important examples are ``homogeneous object
container classes'' like simple stacks, queues, sets, lists, etc., ,
in which {\em copies} of objects (created via {\tt X(const X\&)} or
{\tt X::operator = (const X\&)}) passed by reference into an insertion
member function are placed inside a container. Such methods are
type-specific, since exact versions of the copy constructor and/or
assignment operator must be hard-wired into implementations. It is
almost always the class author's intention to disallow {\tt sub-T}
arguments, in order to prevent suprising and undesirable ``chopped
copies'' and unintentional insertion of objects without exact type
{\tt T}.

Such containers bear more than a passing similarity to C arrays.  For
both efficient implementability and type safety, these containers are
strictly homogeneous, like C arrays.  Array references can provide a
modicum of help for expressing this correspondance. Since arrays are
necessarily homogeneous, arguments for container methods can be
declared as arrays of single elements. For example, a simple Stack
could be declared via,

{\samepage\vspace{-5.0pt}\begin{verbatim}
class TStack100 
{
private:
  int  sp;   T data[100];
public:
  void push(const T (&a)[1]) { data[sp++] = a[0]; }
  //...
};
\end{verbatim}}
\noindent
so as to disallow the use of {\tt sub-T}'s as arguments to {\tt push}.

Note that, generally, this technique extends the kinds of optimization
opportunities discussed above. Use of exact types in these contexts,
as enabled by homogeneity restrictions, means that virtual function
calls can be statically resolved.

Of course, {\em using} such a class would not be especially simple or
convenient, since arguments for methods like {\tt push} would either
need to be framed as singleton arrays, or explicitly coerced into
arrays, which, if done routinely via macros or the like, ends up
defeating the safety goals. 

However, there is a way to accomodate simpler client usage via an ad
hoc special case rule.  Recall that allowing {\em implicit} coercions
from scalar to array references and vice versa negates the
optimizability goals discussed with respect to the {\tt intVec}
example. However, if intercoercibility were allowed for the special
case of {\em singleton} closed array references only, pragmatic
attainment of both goals seems reachable. That is, declaration and
matching rules could be extended to include:

\begin{itemize}
\item A {\tt T (\& singleton)[1]} may alias any object of exact
    type {\tt T}.
\end{itemize}

Overloading sensitivity and dispatch rules would have to be modified
accordingly.  

This is an uncomfortable proposal because it yet further complicates
matching, coercion, and overloading rules, which are difficult
enough as it is. However, the resulting type safety enhancements
seem well worth the trouble.

An alternative proposal, that obtains the same effect without taking
a backdoor approach, is briefly discussed in my ``customization''
paper.

\section{Epilogue}

The primary limitation of array references is that they, like any
other references are not ``reseatable'': C++ references may only alias
one object throughout their lifetimes. This makes the use of array
references impossible in common cases involving resizing,
copy-on-write, delayed allocation, sharing, etc., of referred-to
arrays.

Surely others have proposed to X3J16 that this restriction be removed.
I would like to go on record as supporting any such change.  Among the
simplest possible mechanisms for doing so is to define a
non-overloadable built-in polymorphic pseudofunction (i.e., with the
same status of {\tt sizeof}), {\tt void reseat(ANY\& r, ANY\& newobj)}
that makes {\tt r} refer to {\tt newobj} Any other constructs with the
same effect would suffice. Of course {\tt const} references should not
be reseatable.

Allowing reseatable references would contribute to the battle to
create safer, more optimizable subsets of C++. The utility of
references in terms of type precision, inherent promises against
performing pointer arithmetic, and simplicity of use should not
be wasted because of quibbles about the most appropriate means
for expressing things like reseating.

\end{document}
--
Doug Lea  dl@g.oswego.edu || dl@cat.syr.edu || (315)341-2688 || (315)443-1060
|| Computer Science Department, SUNY Oswego, Oswego, NY 13126 
|| Software Engineering Lab, NY CASE Center, Syracuse Univ., Syracuse NY 13244