[comp.lang.c] modification of strings

chad@lakesys.UUCP (D. Chadwick Gibbons) (02/04/89)

        I believe my understanding is mislead when it comes to the
interpretation of string constants and the effect the standard library
functions can preform on them.  Insofar as I have been told, strings can not
be modified - note, the type char *blah = "this is a string"; not the everyday
normal strings we use.  This would appear that if you attempted to modify
their contents, you would either get a core dump of some various flavor, or
the program would ignore your request.

        In general, the function of strcat is defined as

                char *strcat(s, ct)
                char *s, *ct;

Where s is the original string you wish to add too, and ct is the string you
wish to append - as I'm sure you really didn't know that :)

        With that definition, consider the following artificial sequence

                char *blah = "meow";
                char *tmp;

                tmp = strcpy(blah, "grr, snarl, hiss");

I would think since the string 'blah' is considered to be nonmodifiable that
it would not be changed, but the result would be placed into tmp.  However, on
different systems, this provides different results:
        SCO XENIX/286 2.2.2     core dumps on next access of anytype to 'blah'
        SCO XENIX/386 2.3.2     gives various warning messages but treats
                                'blah' like a normal string
        BSD 4.2                 does random things
        AT&T System V r3        refuses to work on Thursdays, but acts like
                                XENIX/386 on others

Apparently, either the effect of strings is not yet defined in these
implementations, or, more likely, what I was taught is incorrect.

Enlightenment is welcomed.
-- 
D. Chadwick Gibbons, chad@lakesys.lakesys.com, ...!uunet!marque!lakesys!chad

gwyn@smoke.BRL.MIL (Doug Gwyn ) (02/04/89)

In article <345@lakesys.UUCP> chad@lakesys.UUCP (D. Chadwick Gibbons) writes:
>Insofar as I have been told, strings can not be modified ...

That depends on the implementation.  Some permit it.  However,
you cannot portably count on being able to modify a string literal.

>                char *blah = "meow";
>                char *tmp;
>                tmp = strcpy(blah, "grr, snarl, hiss");
>I would think since the string 'blah' is considered to be nonmodifiable that
>it would not be changed, but the result would be placed into tmp.

No, check the definition of strcpy().  You're attempting to modify a
string literal.  strcpy() is not obliged to second-guess your intentions
and somehow save your ass.  In fact in most implementations it isn't
able to efficiently ascertain that you're misusing it until it makes
the actual write attempt, at which point it's already too late.

gandalf@csli.STANFORD.EDU (Juergen Wagner) (02/05/89)

In article <345@lakesys.UUCP> chad@lakesys.UUCP (D. Chadwick Gibbons) writes:
>...
>				Insofar as I have been told, strings can not
>be modified - note, the type char *blah = "this is a string"; not the everyday
>normal strings we use.  This would appear that if you attempted to modify
>their contents, you would either get a core dump of some various flavor, or
>the program would ignore your request.

Actually, the effect you get depends on the system you're trying this on. If
your machine puts the string into text space, together with your code (HP-UX
does that), you loose when you try to change the string. Other systems usually
have a loader option to specify this behavior.

For the sake of portability, assume constant strings are read-only.

>...[TFM quote deleted]
>
>                char *blah = "meow";
>                char *tmp;
>
>                tmp = strcpy(blah, "grr, snarl, hiss");

Ok. New let's see what this piece of code does. 'blah' is a pointer to char.
It is initialized to point at the first character of the char vector
	< 'm', 'e', 'o', 'w', '\0' >
which occupies five bytes. 'tmp' is just another pointer to char but
uninitialized.

The strcpy statement copies a string of length (humph!) 16 + 1 (for the 
zero byte) into consecutive byte locations from the point 'blah' points
to on.

Hmmm.... your compiler allocated five bytes for the string but you are now
using 17 for the new string. Strcpy will just overwrite whatever follows the
string. If that happens to be another statically allocated string, it will
show changed contents. If that happens to be some data space, variables seem
to change values. If that happens to be just beyond the allocated memory page,
you get some kind of error (segmentation fault et al.). If your compiler
happily put the string in the midst of code, you will either overwrite code
or get some error like segmentation fault (text space is Read-Only). If the
string was allocated on the stack, your return address might be f***ed up.
If....

As you can see, there is a vast number of alternatives, and you tried some of
them.

As a rule of thumb, I suggest to check calls to destructive functions like
strcat, strcpy, et al. very carefully. Sometimes, they cause errors by over-
writing pieces of memory used in completely different portions of your program,
and the stuff becomes hard to debug. Allocate all the memory you need, and
don't try to overwrite static strings.

-- 
Juergen Wagner		   			gandalf@csli.stanford.edu
						 wagner@arisia.xerox.com

jeenglis@nunki.usc.edu (Joe English) (02/06/89)

chad@lakesys.UUCP writes:
>                char *blah = "meow";
>                char *tmp;
>
>                tmp = strcpy(blah, "grr, snarl, hiss");
>
>I would think since the string 'blah' is considered to be nonmodifiable that
>it would not be changed, but the result would be placed into tmp.  
>[...]
>Apparently, either the effect of strings is not yet defined in these
>implementations, or, more likely, what I was taught is incorrect.

What you were taught is incorrect.

The type "char *" means, "pointer to char."  A char *
can point to either a single character or an array
of characters (or NULL or a garbage value.)  Since
strings are stored as arrays of characters, "char *"
is the type used to reference them; but you still
get pointer semantics, not string semantics as in
other languages.

The str... functions give some string manipulation
functionality, but you still have to allocate space
for the strings themselves.  For example, strcat(char
*s1,char *s2) places a copy of the string pointed to
by s2 immediately after the string pointed to by s1,
where the end of each string is determined by a '\0'
character value.  If s1 doesn't point to an area of
memory large enough to hold both strings, you have
problems.

Another note: the return value of strcat, strcpy,
etc., is for the most part useless.  strcat(s1,s2)
returns s1 (which the caller presumably already
knows); it does *not* make a new string.

So in your example above, blah points to an array
5 characters long which is initialized to 
{'m','e','o','w','\0' }.  Since the array is only 5
characters long, any attempts to write data past its
end (like the call to strcat() does) is going to
cause undefined, usually harmful behaviour.


Hope this helps,

--Joe English

  jeenglis@nunki.usc.edu

norm@oglvee.UUCP (Norman Joseph) (02/06/89)

From article <7429@csli.STANFORD.EDU>, by gandalf@csli.STANFORD.EDU (Juergen Wagner):
> In article <345@lakesys.UUCP> chad@lakesys.UUCP (D. Chadwick Gibbons) writes:
>>
>>                char *blah = "meow";
>>                char *tmp;
>>
>>                tmp = strcpy(blah, "grr, snarl, hiss");
> 
> and the stuff becomes hard to debug. Allocate all the memory you need, and
> don't try to overwrite static strings.

I can see that the above strcpy() will overwrite something somewhere
since strlen( "meow" ) < strlen( "grr, snarl, hiss" ).  But what if
the code looked like this (ignoring `tmp' for this example):

                char *blah = "meow\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0";

                strcpy( blah, "grr, snarl, hiss" );

assuming that you could write to the space into which `blah' pointed?
-- 
Norm Joseph - Oglevee Computer System, Inc.
UUCP: ...!{pitt,cgh}!amanue!oglvee!norm
"Mate, that parrot wouldn't *VROOM* if you put four million volts through it!"

karl@haddock.ima.isc.com (Karl Heuer) (02/09/89)

In article <466@oglvee.UUCP> norm@oglvee.UUCP (Norman Joseph) writes:
>[The previous example overflows,] but what if the code looked like this:
>                char *blah = "meow\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0";
>                strcpy( blah, "grr, snarl, hiss" );
>assuming that you could write to the space into which `blah' pointed?

You'd better also assume that string literals are not shared.  Even so, you
may be in for a surprise when you execute this code fragment the second time,
and find that blah[0]=='g' immediately after the initialization to
(apparently) "meow".

This kludge is confusing and unportable.  Don't use it.

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

henry@utzoo.uucp (Henry Spencer) (02/09/89)

In article <345@lakesys.UUCP> chad@lakesys.UUCP (D. Chadwick Gibbons) writes:
>... Insofar as I have been told, strings can not
>be modified - note, the type char *blah = "this is a string"; not the everyday
>normal strings we use.  This would appear that if you attempted to modify
>their contents, you would either get a core dump of some various flavor, or
>the program would ignore your request...

Not quite; the situation is that either of those things, or something much
more bizarre, can happen.  Note, "can", not "will".  Civilized/portable
programs should never attempt to modify a string literal.  The effects of
trying to modify one are entirely unpredictable.
-- 
Allegedly heard aboard Mir: "A |     Henry Spencer at U of Toronto Zoology
toast to comrade Van Allen!!"  | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

evil@arcturus.UUCP (Wade Guthrie) (02/11/89)

In article <11711@haddock.ima.isc.com>, karl@haddock.ima.isc.com (Karl Heuer) writes:
> In article <466@oglvee.UUCP> norm@oglvee.UUCP (Norman Joseph) writes:
> >[The previous example overflows,] but what if the code looked like this:
> >                char *blah = "meow\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0";
> >                strcpy( blah, "grr, snarl, hiss" );
[. . .] 
> This kludge is confusing and unportable.  Don't use it.

Just thought I might leap into the foray:  One aspect of the portability
issue is, unless I misremember, that the compiler may place string
literals in protected memory (if that exists on your system) causing
an exception and subsequent *BOMB* upon attempted modification.


Wade Guthrie
evil@arcturus.UUCP
Rockwell International
Anaheim, CA

(Rockwell doesn't necessarily believe / stand by what I'm saying; how could
they when *I* don't even know what I'm talking about???)

lupton@uhccux.uhcc.hawaii.edu (Robert Lupton) (02/16/89)

Rumour has it that sscanf modifies strings passed as a first argument
on at least some machines (e.g. some suns?). Well, it doesn't actually
modify the contents, but the compiler doesn't know that. Does anyone
have any information?


			Robert

guy@auspex.UUCP (Guy Harris) (02/18/89)

>Rumour has it that sscanf modifies strings passed as a first argument
>on at least some machines (e.g. some suns?).

"Some" Suns?  Yeesh, "_doscan" isn't one of the machine-dependent
modules; the same source is used on *all* Suns.

In fact, the same source is used on a bunch of non-Sun machines as well;
the SunOS 3.2-3.5 version is based on the S5R2 version, the SunOS 4.0
version is based on the S5R3 version, and the version in SunOS releases
prior to 3.2 is based on the 4.2BSD version, which is probably based on
the V7 version.  The bug exists in S5 releases from AT&T, as well as
4.xBSD.

The problem is that "*scanf" - or, to be precise, "_doscan" and the
routines it calls, which are the "guts" of the "scanf" routines in many
implementations - uses "ungetc".  All very well and good when you're
doing I/O to a file; "ungetc" stuffs the ungotten character back into
the I/O buffer.  However, the way "sprintf" and "sscanf" work in many
(most?) UNIX C implementations is that it turns the string in question
into a "funny" I/O buffer; however, most "ungetc" implementations don't
understand this, and try to stuff the character back into the "buffer"
anyway, which means they try to modify the string.

>Well, it doesn't actually modify the contents,

Which, in this particular case, is, I think, true; the character being
stuffed back is a character that's just been "read" from the string.

>but the compiler doesn't know that.

It's not the compiler that has to know that; it's "ungetc".  In
"comp.bugs.4bsd" this very "sscanf" bug is being discussed; one
suggested fix is to have "ungetc" check whether the character it's
stuffing back into the buffer is the one that is in the buffer and, if
so, just back up the buffer pointer and count.

decot@hpisod2.HP.COM (Dave Decot) (02/21/89)

Note, however, that:

	static char blah[20] = "meow";
	char *tmp;

	tmp = strcpy(blah, "grr, snarl, hiss");

works nicely, because enough space is allocated to hold the longer value,
and the space is guaranteed to be writable.

Dave