[net.lang.c] Diatribe on uninitialized externs

kpmartin@watmath.UUCP (Kevin Martin) (10/25/84)

The following article refers to the entire C environment, including the
compiler, the linker, and the operating system.

There seem to be four alternatives for what to do with externs and statics
which are not explicitly initialized:

1) Have their value be undefined (i.e. garbage).
   Disadvantages:
      Breaks many current programs. It could be argued that well-written
            programs (as opposed to 'correct' programs) would not be broken,
            since a well-written program initializes variables explicitly
            if it cares about the initial value.
      Arrays of unknown size become effectively impossible to
            initialize (at all)(see note 1)
   Advantages:
      Consistent behaviour with autos and malloc'ed space
      Consistent with normal reason (i.e. the variable contains
            a predictable value ONLY IF it has been initialized in the
            C source).
      Tends to encourage easy-to-read code: the reader can tell (or
            *should* be able to tell, if coded cleanly) if there
            is initialization *code* somewhere. e.g. you are sure that in
               int x;
               int y = 5;
            there is initialization code (somewhere) for 'x' but not for 'y'.
      Makes object and a.out files smaller, thus program load time is
            also reduced (note 2)(note 4).
      Allows the programmer to get genuine "bss" (un-initialized) space.
            This becomes especially important if overlays are being used,
            since it may be desired that an overlay be loaded without re-
            initializing all the variables it contains (note 4).

2) Have their value be the 0 bit pattern.
   Disadvantages:
      Programs which don't explicitly initialize their pointers and
            floats would not port to any more machines than they currently
            do (note 3)
      Arrays of unknown size containing floats, doubles or pointers
            cannot be initialized (note 1).
   Advantages:
      This is the current method (i.e. inertia reigns)
      Makes object and a.out files smaller, thus program load time is
            also reduced (note 4).

3) Have their value set to a zero of the appropriate type.
   Disadvantages:
      Requires a somewhat arbitrary rule on "what is the appropriate type
            for a union?"
      Generates larger object files, etc (note 4).
      The programmer cannot signal to the reader that a variable is
            deliberately being left un-initialized.
      Arrays of unknown size cannot be initialized if they contain
            non-zero values.
   Advantages:
      Allows old code to be ported to new machines (note 3).

4) A combination of (1) and (2):  Un-initialized variables start off as
   zero in the first overlay that is loaded. Subsequent overlays get whatever
   was left in the storage location by previous overlays.
   Disadvantages:
     Same as for (1), except that existing programs are not broken.
   Advantages:
      Same as (1), except that sloppy coding has a better chance of
            running.

Note 1:
   By "array of unknown size", I mean, for example, and array whose size
is a #define'd constant. There is currently no method of giving explicit
initializers to such an array in its entirety, unless the source file is
heavily modified each time the #define'd constant is changed.
   Note that the improved CPP facilities (#eval and genuine macros) which
I described in an earlier article would allow such arrays to be initlalized
to *any* value (not just zero bit pattern or zero of the appropriate type),
thus making the variations on this disadvantage go poof.

Note 2:
   Since most systems clear the memory before a program is loaded, for
security purposes, method (1) often flukes out to be method (2).

Note 3:
   If the purpose of the standard does not include porting existing (old)
programs to new C implementations on "hostile" hardware, this advantage/
disadvantage does not exist. I believe that it is the case that the new
standard should allow NEW programs to be written portably, and that old
programs continue to work, but *only on machines on which they already work*.

Note 4:
   These features (reduced object or a.out size, and overlays) may or may
not exist on any particular system, and they may be non-issues to many
users (because they have lots of disk space, or they think overlays are for
the birds). However, these features *do* exist on some systems, and the
users *do* find them useful, and it would be desireable that the standard
*not* be written such that a compiler has to be non-conforming to take
advantage of such features.


If overlays are going to be ignored, (2) and (4) are equivalent.

Ignoring the problems of upward compatibility and lazy programming
styles, choice (1) is the winner. However, given that old
programs must continue to work, Choice (4) looks like the best one.

The only bad problem with (4) is that of array initialization. As mentioned
above, this can be solved much more generally with an improved CPP.
This standard will probably not include such features, or a method of choosing
which union member to initialize. But there will be more C standards down
the road, and these features may appear, making (1), (2) or (4) the clear
winning choices.
If the committee goes for choice (3) now, this will only encourage code
which doesn't explicitly initialize things, and make for an even larger
base of software to break when the next standard tries to go back to
choice (1) or (2).

I consider (4) with improved CPP to be the long-range goal, and the
implementation of (3) in the current standard prevents changing to (4)
in the next standard.
We can either let it sit as is for now, and fix it properly when the
facilities become available, or we can (for the feeble reason of
porting old shit code to new machines) paint ourselves into yet another
corner by fixing it poorly immediately.
                       Kevin Martin, UofW Software Development Group

henry@utzoo.UUCP (Henry Spencer) (10/25/84)

It is definitely much too late to remove default initialization from C;
far, far too much code depends on it, including the Unix kernel.  Adding
features is one thing.  Taking them away is another.

Note that "not breaking existing correct code" is a major objective of
the ANSI committee.  This means that default initialization must be
present, and must follow either the zero-bit-pattern or as-if-integer-0
rule.  The oddball machines are the only ones that pay a penalty for the
wrong choice, so it comes down to a simple question of whether the
people with such machines would rather maximize portability of older
software to their machines, or maximize the efficiency (object-module
size and load time) of new code.  This is clearly a case where those
of us with un-oddball machines should keep quiet; it is presumptuous of
us to tell the oddball-machine people "we know what you ought to want".
I suspect that really odd machines may end up with a compiler option to
settle the matter; probably the default setting should be "portable".

If overlays are being used mostly to get more code into a limited space,
then clearly they should not affect the data.  Such overlays are logically
just an implementation technique for fitting lots of code into a small
space, and (ideally) should not be visible on the language level at all.
If it is specifically desired that overlays overlay the data space as
well (i.e. act like exec()), then there's no problem.  If what you have
is something in between, then I think the only practical answer is that
your techniques and the problems associated with them are implementation-
dependent and are not a standards issue.  Depending on what sort of overlays
you have, the data gets left alone, or trashed, or re-initialized wholly
or partly; I see no reason for an ANSI standard to try to bless one kind
of overlays and condemn the others.

> If the committee goes for [integer-0] now, this will only encourage code
> which doesn't explicitly initialize things, and make for an even larger
> base of software to break when the next standard tries to go back to
> [no-default-initialization] or [zero-bit-pattern].

Why in the world would the next standard do anything so stupid?  You are
setting up a straw man.  Of course there will be hell to pay if the next
standard goes out of its way to be incompatible with the current one, but
that's true regardless.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (10/29/84)

I have omitted much of the original text to keep the size down:

> There seem to be four alternatives for what to do with externs and statics
> which are not explicitly initialized:
> 
> 1) Have their value be undefined (i.e. garbage).
>    Disadvantages:
>       Breaks many current programs.

Very important from an economic standpoint.

>    Advantages:
>       Allows the programmer to get genuine "bss" (un-initialized) space.
>             This becomes especially important if overlays are being used,
>             since it may be desired that an overlay be loaded without re-
>             initializing all the variables it contains (note 4).

Any overlay system that reinitializes variables is WRONG.

> 2) Have their value be the 0 bit pattern.

As has been pointed out in another discussion, the 0 bit pattern may
not be appropriate for pointers on some machines.

>    Disadvantages:
>       Arrays of unknown size containing floats, doubles or pointers
>             cannot be initialized (note 1).

If all elements are to be initialized to the same value, then this
statement is false.  The last explict initializer is repeated as necessary
to fill out the array.

> 3) Have their value set to a zero of the appropriate type.
>    Disadvantages:
>       Requires a somewhat arbitrary rule on "what is the appropriate type
>             for a union?"

Easy; the type of the first member.

>       The programmer cannot signal to the reader that a variable is
>             deliberately being left un-initialized.

Sure he can.  Nothing prevents you from specifying initializers you care
about and letting the rest default.  I do this anyway.

>       Arrays of unknown size cannot be initialized if they contain
>             non-zero values.

See above.

> 4) A combination of (1) and (2):  Un-initialized variables start off as
>    zero in the first overlay that is loaded. Subsequent overlays get whatever
>    was left in the storage location by previous overlays.

The C language says nothing about overlays.  This is an implementation
issue that must be addressed by the overlay system designer, but it
does not belong in the language specification.

> Ignoring the problems of upward compatibility and lazy programming
> styles, choice (1) is the winner. However, given that old
> programs must continue to work, Choice (4) looks like the best one.

> If the committee goes for choice (3) now, this will only encourage code
> which doesn't explicitly initialize things, and make for an even larger
> base of software to break when the next standard tries to go back to
> choice (1) or (2).

I vote for choice (3).  I don't see that arrays or overlays have anything
to do with the choice among these alternatives.  (3) is cleanest.

kpmartin@watmath.UUCP (Kevin Martin) (10/30/84)

>>I wrote:
>Doug Gwyn writes:
>> There seem to be four alternatives for what to do with externs and statics
>> which are not explicitly initialized:
>> 
>> 1) Have their value be undefined (i.e. garbage).
>>    Disadvantages:
>>       Breaks many current programs.
>
>Very important from an economic standpoint.
Yes. That is why I am not advocating this alternative.

>>    Advantages:
>>       Allows the programmer to get genuine "bss" (un-initialized) space.
>>             This becomes especially important if overlays are being used,
>>             since it may be desired that an overlay be loaded without re-
>>             initializing all the variables it contains (note 4).
>Any overlay system that reinitializes variables is WRONG.
Depends what you want. But as I said, I am not advocating this alternative
for C.

>> 2) Have their value be the 0 bit pattern.
>As has been pointed out in another discussion, the 0 bit pattern may
>not be appropriate for pointers on some machines.
Whether it is appropriate for some machines does not affect its
existance as an alternative.
The point I made was to consider whether it is ANSI's goal to support
old code on new machines. I realize there are such machines; I have to
write a C compiler for at least two of them. Complete with non-zero
NULL pointers.

>>    Disadvantages:
>>       Arrays of unknown size containing floats, doubles or pointers
>>             cannot be initialized (note 1).
>If all elements are to be initialized to the same value, then this
>statement is false.  The last explict initializer is repeated as necessary
>to fill out the array.
Oh yeah? Since when? k&r appendix A section 8.6 paragraph #5 says missing
items get zeroed.

>> 3) Have their value set to a zero of the appropriate type.
>>    Disadvantages:
>>       Requires a somewhat arbitrary rule on "what is the appropriate type
>>             for a union?"
>Easy; the type of the first member.
As I said, an arbitrary rule. I didn't say it would be difficult to come up
with one. Besides, who says I always want the same variant of the union
initialized. What if one array uses the 'int' element (so I want int zeros),
and another array uses the pointer element (so I want NULL's)?
>
>>       The programmer cannot signal to the reader that a variable is
>>             deliberately being left un-initialized.
>Sure he can.  Nothing prevents you from specifying initializers you care
>about and letting the rest default.  I do this anyway.
If you do this anyway, why do you care about this pariticular discussion?
That's fine & dandy if I can be sure that whoever wrote the code was as
diligent as you and I are about initializing things we really care about
(and drank enough coffee that day, etc.)
I agree, he can signal deliberately *initialized* ones. But given a declaration
like
	int x;
I can't tell if he really cares that there is a zero there, if he forgot
to initialize it, or if he doesn't care what is in it. This is what I
mean by deliberate *lack of initialization*. It is (partly, at least)
a statement of my opinion (and yours, judging from your comment above) that
a programmer should always use explicit initializers if the initial value
matters. If that practice were followed by everyone, this discussion
would hardly be as exciting. Unfortunately, it isn't always possible,
if you have a union or an array.

>>       Arrays of unknown size cannot be initialized if they contain
>>             non-zero values.
>See above.
Yes, please do.
>
>> 4) A combination of (1) and (2):  Un-initialized variables start off as
>>    zero in the first overlay that is loaded. Subsequent overlays get whatever
>>    was left in the storage location by previous overlays.
>The C language says nothing about overlays.  This is an implementation
>issue that must be addressed by the overlay system designer, but it
>does not belong in the language specification.
The language should not make the "overlay system designer"'s job impossible.
The language spec doesn't even have to mention overlays. It merely has to
say what happens to un-initialized variables.
>
>> Ignoring the problems of upward compatibility and lazy programming
>> styles, choice (1) is the winner. However, given that old
>> programs must continue to work, Choice (4) looks like the best one.
>
>> If the committee goes for choice (3) now, this will only encourage code
>> which doesn't explicitly initialize things, and make for an even larger
>> base of software to break when the next standard tries to go back to
>> choice (1) or (2).
>
>I vote for choice (3).  I don't see that arrays or overlays have anything
>to do with the choice among these alternatives.  (3) is cleanest.
As far as 'cleanliness', I disagree, but I admit that this is just an
opinion

What I was trying to state is that there are two ways to solve the problem
of hostile machines (without them, this problem wouldn't exist).
One method is better CPP; it requires extensions to CPP, and the programmer
   will have to type (as in keyboard, not as in typedef) more, but this
   approach is far more powerful (for going *beyond* a minimal solution
   to the original problem). It even lets you initialize all the rest of
   the array elements to a non-zero value (as you seem to think C can
   already do).
The other method is this implicit zero-of-the-right-type initialization.
   It solves *only the immediate problem* and is otherwise a dead end.

These two solutions are not incompatible, but implementing both of them
(eventually...) will give no advantages over just the CPP solution,
but will have the disadvantages of both.

I doubt that any CPP changes are forthcoming in this standard, but I don't
like stopgap solutions, either.

I would rather see it stay as is for now (zero bit pattern).
Bear in mind that I use such hostile machines in my regular work, and
have encountered no problems with the current rules (other than not
being able to tell deliberate un-initialized from deliberate but implicit
zero, a problem which is not solved by typed-zero initialization).
                  Kevin Martin, UofW Software Development Group

gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (11/02/84)

> >...  The last explict initializer is repeated as necessary
> >to fill out the array.
> Oh yeah? Since when? k&r appendix A section 8.6 paragraph #5 says missing
> items get zeroed.

Well, I tried this on a UNIX System V PCC and what Kevin says does
describe its behavior.  I wonder where I got the other idea (which
I suggest is a better rule, but incompatible with current behavior).

> ... Besides, who says I always want the same variant of the union
> initialized. What if one array uses the 'int' element (so I want int zeros),
> and another array uses the pointer element (so I want NULL's)?

In many years of C programming I have never had such a requirement.
Unions are pretty much a kludge for things like memory allocation.
I can think of a general way to specify this type of initialization:
	union	{
		int	a;
		double	b;
		char	*c;
		}
	foo =	{
		,
		3.14159,

		};
using explictly empty members in the initializer (this would apply to
structs as well as unions).  The only incompatibility with current C
that I see here is the slightly different meaning of the final , in
the initializer list.  This solution avoids the ambiguity of using just
a type specifier (which would work for unions but not for structs) and
having to supply explicit member names for initializers (which calls for
a significant change to existing compilers).

> >>       The programmer cannot signal to the reader that a variable is
> >>             deliberately being left un-initialized.
> >Sure he can.  Nothing prevents you from specifying initializers you care
> >about and letting the rest default.  I do this anyway.
> I agree, he can signal deliberately *initialized* ones. But given a declaration
> like
> 	int x;
> I can't tell if he really cares that there is a zero there, if he forgot
> to initialize it, or if he doesn't care what is in it...

Unless you outlaw
	int x;
altogether, or require that there be an explicit initializer SOMEWHERE
among all the load modules (`a la Whitesmiths), you still won't be able
to tell if he wanted the default initialization according to the rules
(assuming a "non-junk" rule), if he doesn't care, or if he forgot.  Using
the same method I suggested above for struct/unions,
	int x = { };
is an explicitly empty initialization showing that the programmer has
thought about the matter and decided that he didn't care what's there.

> The language should not make the "overlay system designer"'s job impossible.
> The language spec doesn't even have to mention overlays. It merely has to
> say what happens to un-initialized variables.

I agree that something definite should be said about the initial contents
of un-initialized variables.  If an overlay system designer finds that
there is no practical way of avoiding clobbering variables (by reloading
their initial values), then he has to give up the idea of transparent
overlay facilities, since it is clear that non-auto/register variables
are intended to retain whatever is stored into them until the program
explicitly stores something else there.  This is perfectly reasonable
and any violation needs to be announced loudly to the user of that
particular overlay system (which should not get in the way when the user
elects NOT to use overlays).  Very few overlay systems (including the one
Ron Natalie and I did for JHU/BRL PDP-11 UNIX) are COMPLETELY transparent
at the source code level, although that is certainly a desirable goal.

> Bear in mind that I use such hostile machines in my regular work, and
> have encountered no problems with the current rules (other than not
> being able to tell deliberate un-initialized from deliberate but implicit
> zero, a problem which is not solved by typed-zero initialization).

If the rule is that uninitialized data is filled with proper-typed zero,
then it seems that you wouldn't have to care which was intended (since
the value of deliberately un-initialized "don't care" storage cannot be
correctly used until it is stored into).  The problems would appear to
be due to trying to follow different rules, for example using specially-
tagged "illegal data" values or "not defined" memory manager traps for
uninitialized data instead of zero.  By the way, I think we should beat
on the hardware designers who keep dreaming up these "helpful" features
without checking with compiler/OS implementers to see what their effects
will be.  If possible, buy more reasonable hardware and TELL the loser
of the competition just what's wrong with his fancy design.

The reason for zero bit pattern is clearly because that is what UNIX
does automatically for "bss" storage.  Not all OSes allow one to use
tricks like this, although the C runtime startoff module could be a
fast loop to initialize "bss" to a zero bit pattern.  Typed zero, though,
in general would have to be initialized by the compiler or by a rather
smart link editor (I can think of some other, incredibly ugly, kludges).

I think I will modify my position:  IF uninitialized data HAS to have
some valid value, then I would (still) recommend 0 of the appropriate
type rather than a 0 bit pattern.  This seems to be compatible with
currently portable C code.  However, if one is willing to drop the
compatibility requirement (apparently the ANSI committee is not), then
I would have uninitialized data contents UNKNOWN, possibly trap-causing,
if they are used before being defined.  That would help stamp out sloppy
coding practices (nothing will completely solve this problem).

kpmartin@watmath.UUCP (Kevin Martin) (11/03/84)

>	foo =	{
>		,
>		3.14159,
>
>		};
>using explictly empty members in the initializer (this would apply to
>structs as well as unions).  The only incompatibility with current C
>that I see here is the slightly different meaning of the final , in
>the initializer list.
This implies that the ordering of union elements is significant, which
is currently not the case. But the 'zero-according-to-the-type-of-the-
first-union-element' rule also does this.
I would still prefer to name the element when I am explicitly initializing
it, but that is a different discussion.

>If the rule is that uninitialized data is filled with proper-typed zero,
>then it seems that you wouldn't have to care which [don't-care about the
>value vs. wanting a zero] was intended...
When I go to modify code which was written by someone else (or by
myself in the long-forgotten past), I do care. If it was deliberately
left un-initialized, but I don't realize this, and I now give it an
explicit initializer (non-zero) for some purpose, it will have
no effect, because of the initialization code which I failed to
look for. And I can't trust other people to have explicitly initialized
*every* variable they cared about.

>The problems would appear to
>be due to trying to follow different rules, for example using specially-
>tagged "illegal data" values or "not defined" memory manager traps for
>uninitialized data instead of zero.  By the way, I think we should beat
>on the hardware designers who keep dreaming up these "helpful" features
>without checking with compiler/OS implementers to see what their effects
>will be.  If possible, buy more reasonable hardware and TELL the loser
>of the competition just what's wrong with his fancy design.
I always did like somple machines like the Nova... there were so few
ways of doing anything that most of the code was already optimal!
Unfortunately, in the cases I am working with, guess who is paying me
(indirectly). At least I don't have the funny undefined values and tags
you mention.

>
>I think I will modify my position:  IF uninitialized data HAS to have
>some valid value, then I would (still) recommend 0 of the appropriate
>type rather than a 0 bit pattern.  This seems to be compatible with
>currently portable C code.  However, if one is willing to drop the
>compatibility requirement (apparently the ANSI committee is not), then
>I would have uninitialized data contents UNKNOWN, possibly trap-causing,
>if they are used before being defined.  That would help stamp out sloppy
>coding practices (nothing will completely solve this problem).
>Doug Gwyn
That seems to reflect my feelings for the *eventual* resolution of this
question. Once having chosen one of these paths, there will be no backing
out, which is why I am loathe to choose either of them in the first standard.

I think we've run out of things to discuss... (I hear the entire net breathe
a sigh of relief)
                       Kevin Martin, UofW Software Development Group

henry@utzoo.UUCP (Henry Spencer) (11/06/84)

> ... Once having chosen one of these paths, there will be no backing
> out, which is why I am loathe to choose either of them in the first standard.

I hate to mention this, Kevin, but the first standard was the first
edition of DMR's C Reference Manual.  The ANSI standardization process is
different only in degree.  There is some cause for debate about which
of the two paths should be chosen, but the decision to choose one
of the two was made perhaps a decade ago.  Like C's revolting switch
syntax, it's far too late to reconsider this now.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry