[comp.lang.c++] Strings

dl@g.g.oswego.edu (Doug Lea) (10/24/89)

Two issues involving Strings...

Andy Koenig says...

> > 	a+b = c;
> 
> > appears to be legal. (As least it compiled under 1.2.) Is it legal under
> > 2.0 ? What does it really mean? Shouldn't  the '=' operator be forced
> > to only accept an lvalue as its left-hand operand?
> 
> How about making operator+ return a const matrix?
> Then you won't be able to assign to it.
> 
> 
> To tell the truth, I hadn't thought about this issue until this
> question forced me to do so.  There are zillions of things like
> string classes out there that say
> 
> 	extern String operator+(const String&, const String&);
> 
> and apparently they really should say
> 
> 	extern const String operator+(const string&, const String&);

This doesn't seem like the right solution. Consider

String& addeol(String& s) { s += "\n"; return s; }

main()
{
  String a, b; //...
  String c = addeol(a+b);
  //...
}

which would be illegal if operator+ returned a const String. (Yes,
the form of `addeol' is contrived, but not indefensible.)

Actually, I think the `a+b = c;' issue is more of a curiosity -- an
inherent difference between classes and builtins -- than a real
problem. (There are a couple of other class vs builtin differences
along these lines that I briefly mentioned in my Denver Usenix paper.)
The code is legal and compiles (at least with libg++ Strings), but
results in a temporary being created for (a+b), then modified via the
assignment (=c), but never bound to any symbol, so inaccessible.
While this looks odd, it does do exactly what the programmer
specified.

In an unrelated thread, Jerry Schwarz says...

> Indeed, more than discussed.  This is essentially the method
> used by the AT&T 1.2 stream package.  There are several
> problems with it.  Where does the space come from for the string?  
> How about all the twiddles on formatting available in stdio?
> (e.g. the case of the alphabetic "digits" in a hex number)
> 
> But you don't have to choose.  Its fairly easy to implement
> the functionality of the above without intermediate strings.
> 
> One (among several choices) is
> 
> class decimalString() {
> public:
> 	decimalString(int v, int w) : value(v), width(w) { }
>     int	value ;
>     int width ;
>     } ;
> 
> ostream& operator<< (ostream& o,decimalString& s)
> {
>     int f = o.flags();
>     o << dec << setw(s.w) << s.value ;
>     o.setf(p,ios::basefield);
>     return o ;
>     }
> 
> There is a philosopical point here.  In C the builtin types are 
> special.  Its perfectly reasonable to have a C I/O library that 
> has a lot of formatting stuff for them.  In C++ user defined classes
> are just as important as the builtin types.  What is important is 
> not that there be a lot of formatting stuff for the builtin types, 
> but that there be a mechanism for extending the I/O.  In C++ it is 
> usually much better to determine styles of printing, widths and
> the like based on the role (type) type of the data rather than
> specifying it at each individual I/O statement.
> 
> In hindsight I think I put too much special stuff in the
> iostream library for the builtin types.  Historically, what
> happened was that the builtin type stuff was done first, and
> only much later did I develop the extensibility features
> (such as xalloc).  


I see the basic problem here just a little differently. The most
primitive stream output routine for printing strings might go
something like:

ostream& ostream::put(const char* p)
{
  while (*p != 0) put(*p++);
  return *this;
}

This can be problematic if you'd like to have ostream << int do
something like

    char* dec(int i);
    ostream& operator << (ostream& s, int i) {  return put(dec(i)); };

since, as Jerry notes, you then have to decide how to allocate the
space for the results of dec(). To make dec() reasonably general, you
can't just use a fixed static buffer, or else
    cout << dec(10) << dec(20);
would not work right if, for example, the compiler uses right-to-left
evaluation (which is legal).

So instead, you might want to get around this by employing your
off-the-shelf String class:

    class String
    {
      char*  s;
    public:
      
             operator const char* () { return s; }
    //... lots of other stuff
    };

and redo dec() as

    String dec(int i);

but now, something even more unfortunate can happen in

    ostream& operator << (ostream& s, int i) {  return put(dec(i)); }

Since dec() returns a String, but put() wants a char*, the String 
operator const char* () conversion is made. However, this too
can fail! The reason has to do with C++ lifetime rules for
temporaries: The temp String returned by dec is `used up' by
the char* conversion, so the compiler is allowed to kill it off
*before* entering put(). But the `conversion' really just returns
a pointer into the String, so if the String is killed off, the pointer
is invalid, and things are broken again. In other words, the char*
conversion operator cannot just return a pointer, it must allocate
some space, and copy the String representation. But where? Back to
square one.

Here are some solutions:

1) Make an ostream << String operator, and use it exclusively instead
of char*'s, from the ground up, in ostreams. This is the right
solution in many senses, but is problematic in that it presupposes
that there is a single, best String class out there suitable for all
needs. But there are many good String classes around. Standardizing on
a particular version to serve as the basis for the de facto standard
stream library seems premature.

2) Change the C++ rules about lifetimes for temporaries, so that they,
like `normal' variables have lifetimes to the end of the enclosing
scope. This solution has merit on other grounds as well, but also
creates some of its own difficulties. Actually, this may be going too
far.  The lifetime rules for temporaries say that if a *reference* to
a temp (or any part thereof?) is taken (or any ref-returning member
function is called?), then its lifetime *is* to the end of the enclosing
scope. The char* conversion *behaves* like a reference, but is not
one. I once proposed that C++ allow the idiom of a char[]& to mean a
reference to a character array. Support of this would solve this (and
other) problems, since one could create a
    char[]& String::chars() { return s /* or whatever */ ; }, 
call it inside the ostream << int via `return put(dec(i).chars())',
and everything would work just right.  But no one has ever told me
that they particularly like this idea.

3) Have dec() and friends return freestore allocated space, and
require that programmers manually delete them. Most users wouldn't
like this very much.

4) Use a garbage collection scheme for formatting strings, and/or
Strings in general. This seems to be overkill for the problem
at hand. Strings themselves are very-well behaved lifetime-wise, it's
the char* conversions that raise problems.

5) Create a simple approximation to garbage collection. Set up a pool
of space to be used for miscellaneous conversions, and use it for
dec(), oct(), and so on. Guarantee that the most recent N (some FIXED
number, say, 100) formatting strings will be on hand at any given
time. The pool manager can then reuse the space for old formatting
strings when needed.  Both AT&T 1.2, and libg++-1.36.0 use some
variation of this approach. The String const char*() operator may also
copy into this pool. The major drawback is that if programmers
contrive expressions that requires more than N live formatting
strings, then they are out of luck.

6) Avoid reliance on generic conversion functions like dec(), and
build special conversion buffers, etc., into the stream classes. AT&T
2.0 streams appear to do something along these lines. As Jerry says,
this puts too much smarts in the stream classes, but is entirely safe.
Unfortunately, it is also not as easily extensible as one might like.
It is awkward (although not impossible) to use this scheme to output,
say, arbitrary-precision Integers or other types in which the user
class, not the stream class knows how to set things up for formatting.
It also limits generality a bit. Formatting strings are sometimes
needed for other purposes than ostream output.

--
Doug Lea, Computer Science Dept., SUNY Oswego, Oswego, NY, 13126 (315)341-2367
email: dl@oswego.edu              or dl%oswego.edu@nisc.nyser.net
UUCP :...cornell!devvax!oswego!dl or ...rutgers!sunybcs!oswego!dl

ark@alice.UUCP (Andrew Koenig) (10/25/89)

In article <DL.89Oct24075027@g.g.oswego.edu>, dl@g.g.oswego.edu (Doug Lea) writes:

> This doesn't seem like the right solution. Consider

> String& addeol(String& s) { s += "\n"; return s; }

> main()
> {
>   String a, b; //...
>   String c = addeol(a+b);
>   //...
> }

> which would be illegal if operator+ returned a const String. (Yes,
> the form of `addeol' is contrived, but not indefensible.)

I suggest that operator+(const String&, const String&) should
return a const String precisely so that stuff like the example
above will be illegal.

The trouble with the example is that the value of a+b is a temporary
that can be destroyed as soon as addeol() returns.  Thus it seems
to me that it should be OK for a compiler to generate code that
looks like this:

	evaluate a+b into a temporary T
	call addeol(T) and save a reference to the result
	destroy T
	copy the saved result of addeol() into c

In this case, the `saved result' of addeol will have been destroyed
before copying it, so c will be garbage.

You might say that this argues that the destruction of the temporary
that holds a+b should be deferred until later.  Unfortunately, doing
that doesn't eliminate the problem, it just makes it less likely.
-- 
				--Andrew Koenig
				  ark@europa.att.com

jss@jra.ardent.com (Jerry Schwarz (Compiler)) (10/25/89)

In article <DL.89Oct24075027@g.g.oswego.edu> dl@oswego.edu writes:
>
>6) Avoid reliance on generic conversion functions like dec(), and
>build special conversion buffers, etc., into the stream classes. AT&T
>2.0 streams appear to do something along these lines. As Jerry says,
>this puts too much smarts in the stream classes, but is entirely safe.

My remark was subject to misinterpretation.  I'll try to clarify.  

The 2.0 iostream classes contain mechanisms (xalloc, bitalloc, iword,
and pword) to support formatting state for user defined
classes.  If I were redoing the package I would be inclined
to use that general mechanism to deal with the builtins
as well.  This would eliminate all the special stuff for them.

I'm not sure what "special conversion buffers" are.  I don't
think the iostream library has anything that is reasonably described
with that phrase. 

>Unfortunately, it is also not as easily extensible as one might like.

I'm not sure whether this refers to functionality or the amount
of effort required to write the extension.   It does require
more coding than I would like to do some kinds of extensions,
but I've achieved a reasonable functionality in all cases I've 
encountered.

Jerry Schwarz

dl@g.g.oswego.edu (Doug Lea) (10/25/89)

I had written...

> > String& addeol(String& s) { s += "\n"; return s; }
> 
> > main()
> > {
> >   String a, b; //...
> >   String c = addeol(a+b);
> >   //...
> > }
> 
> > which would be illegal if operator+ returned a const String. (Yes,
> > the form of `addeol' is contrived, but not indefensible.)
> 

Andy replied...

> The trouble with the example is that the value of a+b is a temporary
> that can be destroyed as soon as addeol() returns.  Thus it seems
> to me that it should be OK for a compiler to generate code that
> looks like this:
> 
> 	evaluate a+b into a temporary T
> 	call addeol(T) and save a reference to the result
> 	destroy T
> 	copy the saved result of addeol() into c
> 
> In this case, the `saved result' of addeol will have been destroyed
> before copying it, so c will be garbage.
> 
> You might say that this argues that the destruction of the temporary
> that holds a+b should be deferred until later.  Unfortunately, doing
> that doesn't eliminate the problem, it just makes it less likely.
> 

It's hard to be sure. In my (draft) copy of the 2.0 Reference Manual,
section 12.2, it says

    The compiler must ensure that a temporary object is destroyed.
    There are only two things that can be done with a temporary:
    fetch its value (implicitly copying it) to use in some other
    expresssion, or bind a reference to it. If the value of a 
    temporary is fetched, that temporary is dead and can be destroyed
    immediately. If a reference is bound to the temporary, the
    temporary must not be destroyed until the reference is. This
    destruction must take place before exit from the scope in which
    the temporary is created.

This statement does not explicitly address what happens with multiple
references: addeol makes a ref of the temp holding a+b, and in turn
binds another ref to it (the return value). The `right' thing to
do is to not kill the temp until the return val reference is destroyed
(after construction into c). Of course, the compiler cannot know this if
addeol is not inline or is an extern, but I assumed that the above rule
requires that a compiler play it safe, and not kill the temp until
all references spawned from expressions involving it die. But I see that
your interpretation could also be right.

As I too weakly implied elsewhere in my last note, I think the temp
rules could be strengthened by generalizing this paragraph via the
simple statement that a temporary (like any normal variable) may be
destroyed only when a compiler can prove that it is no longer useful,
or at the end of the enclosing scope, whichever comes first. This rule
is similar to those used in other languages. The rule requires that if
a temp is involved in any reference-returning member or top-level
function, or a reference is bound to any of its parts, it should live.
It might also be allowed to live if the compiler can prove that it
will be recomputed/reused later in the same block (as may be
discovered via available expression analysis), thus killing off the
recomputation. On the other hand, it might be killed immediately if it
is never used (e.g., the single statement `a+b;'), a possibility
ignored in the above. This restatement would thus cover the current
cases, and also allow the possiblity of a smarter compiler doing
smarter things. (I guess I should add that, as we've gone over before,
a rule like this also helps legitimize and extend the current practice
of not generating X(X&)-based temporaries at all in some situations.)

I should emphasize that the form of addeol is NOT one I recommend.

Oh, I should clarify another remark in my last note: OPERAND evaluation
order is undefined in C++ for `cout << dec(10) << dec(20)'. OPERATOR
evaluation is, of course, left-to-right.

--
Doug Lea, Computer Science Dept., SUNY Oswego, Oswego, NY, 13126 (315)341-2367
email: dl@oswego.edu              or dl%oswego.edu@nisc.nyser.net
UUCP :...cornell!devvax!oswego!dl or ...rutgers!sunybcs!oswego!dl