[comp.lang.c] Why are character arrays special

thoth@beach.cis.ufl.edu (Robert Forsman) (02/06/89)

 I've got a question about what is in the standard concerning
initializing pointers to objects other than chars.  I have a
program that works off of strings of ints (I ran out of char
values and switched up).  Anyway, I occasionally need to do
stuff analogous to

	if (strcmp("hello",s)==0)

but it would probably look like this

	if (intcmp((int*){43,21,5,0},command)==0)

 Now you don't really want to have to waste source code space
to declare int strings, you want to declare them on the fly
just like "hello".  Unfortunately, I can't figure out how to
do this so I have resorted to constructs like this :

	static int *GET_TORCH = { 34,52,0};
	if (intcmp(GET_TORCH,command)==0)

  The problem is that even this gives errors in GCC.
<<initializer for scalar variable requires one element.>>
  It IS one element, a pointer to several elements is one
element!?  When you say char *mesg="enter sum>"; the
compiler doesn't complain.  It figures out that enter sum is
a character string that is supposed to reside in the data
space and assigns mesg the pointer to it.  I finally had to
declare it this way :
	static int GET_TORCH[] = { 34,52,0};
 This is probably the second best way of doing it, the stack
isn't cluttered by pointers that have to be initialized and
the value is static, but the best way would be the first
(int *){ 34,52,0 }, no names to clutter the compiler and I
wouldn't have wasted a line for a declaration.
  The question is, what does the standard say about
structure and array constants (and arrays of structure
constants, etc.)?  I think that to not have the capability
to declare something like that on the fly is crippled.  If
it isn't in the standard, somebody get it in there.  If it
IS in the standard, somebody get GNU to add it to their
compiler (or tell me how it's done).


-----------------------------------------------------------
If you're thinking of sueing my employer over my opinions
then I have no other recourse than to tell you that you are
lower than George Bush and Michael Dukkakis put together.

chris@mimsy.UUCP (Chris Torek) (02/07/89)

In article <19742@uflorida.cis.ufl.EDU> thoth@beach.cis.ufl.edu
(Robert Forsman) writes:
>I've got a question about what is in the standard concerning
>initializing pointers to objects other than chars.

It seems to be time for a rerun.

There are no anonymous aggregate objects other than arrays of
chars, just as in classic C.  Aggregate initialisation values,
however, have been made legal, so some of the below does not
apply.

From: chris@umcp-cs.UUCP (Chris Torek)
Newsgroups: net.lang.c
Subject: Re: C Coding Question
Message-ID: <2973@umcp-cs.UUCP>
Date: 16 Aug 86 00:58:44 GMT
Date-Received: 16 Aug 86 00:58:44 GMT
References: <248@killer.UUCP> <138@darth.UUCP>
Reply-To: chris@umcp-cs.UUCP (Chris Torek)

In article <138@darth.UUCP> gary@darth.UUCP (Gary Wisniewski) writes:
>As far as your question about "char *help[]" and "char **help": the two
>forms are IDENTICAL to virtually every C compiler (that's worth its
>salt).  Arrays in C are merely special cases of pointers.  In other
>words, both forms are correct.

	NO!

Ai!  This has been asserted far too often.  Arrays and pointers are
not at all the same thing in C!

>Section 5.3 of K&R explain this more fully.

Indeed it does, and I suggest you read it rather more carefully.

  The correspondence between indexing and pointer arithmetic is
  evidently very close. ... The effect is that an array name *is*
  a pointer expression.  (p. 94)

This does not say that arrays and pointers are *the same*.

  There is one difference between an array name and a pointer
  that must be kept in mind.

Aha!  See p. 94 for that difference.

  As formal parameters in a function defintion,

	char s[];

  and

	char *s;

  are exactly equivalent.... (p. 95)

Here they *are* the same---but note the qualifier: `As formal
parameters'.  In the (unquoted) original example, the array was a
global variable.

There is one other thing which, I guess, adds to this confusion.
Both of the following are legal global declarations in C:

	char	msg0[] = "Hello, world";
	char	*msg1 = "Hello, world";

Given both declarations,

	printf("%s\n", msg0);

and

	printf("%s\n", msg1);

produce the same output.  Yet msg0 and msg1 are not the same:

	printf("%d %d\n", sizeof (msg0), sizeof (msg1));

prints

	13 4

on a Vax; for msg0 is an array, and msg1 is a pointer.  The code
generated for the two declarations is different:

	/* edited assembly output from ccom */
		.data			# Switch to data segment.
		.globl	_msg0		# The array ...
	_msg0:	.asciz	"Hello, world"	# and here it is.

		.data	2		# Switch to alternate data segment.
	L12:	.asciz	"Hello, world"	# The object to which msg1 will point.
		.data			# Back to regular data segment.
		.globl	_msg1		# The pointer ...
	_msg1:	.long	L12		# which points to the object.

String constants comprise two special cases in the compiler.  The
first case is when the constant appears anywhere *except* as an
initialiser for a `char' array.  Here the compiler uses the alternate
data segment to suddenly `create' a new array, initialised to the
string text; it then generates a pointer to that array.  In the
second case the string constant is generated in the primary data
segment, and `is' the array being initialised: the constant is
`unwrapped' into an aggregate initialisation.

The second case is actually the more `conventional' of the two;
other aggregates cannot be created at run time:

	int a[] = { 0, 1, 2, 3 };

is legal only outside functions.  What seems surprising to some is
that the same is true of

	char s[] = "foo";

because, unwrapped, this is equivalent to

	char s[] = { 'f', 'o', 'o', '\0' };
	
---even though

	char *s = "foo";

is legal anywhere a declaration is legal.

Ah, if only C had aggregate initialisers!
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1516)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

walker@ficc.uu.net (Walker Mangum) (02/08/89)

In article <15833@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
> In article <19742@uflorida.cis.ufl.EDU> thoth@beach.cis.ufl.edu
> From: chris@umcp-cs.UUCP (Chris Torek)
> 
> In article <138@darth.UUCP> gary@darth.UUCP (Gary Wisniewski) writes:
> >As far as your question about "char *help[]" and "char **help": the two
> >forms are IDENTICAL to virtually every C compiler (that's worth its
> >salt).  Arrays in C are merely special cases of pointers.  In other
> >words, both forms are correct.
> 

[ K&R quotes deleted ]

> 	char	msg0[] = "Hello, world";
> 	char	*msg1 = "Hello, world";
> 
> Given both declarations,
> 
> 	printf("%s\n", msg0);
> 
> and
> 
> 	printf("%s\n", msg1);
> 
> produce the same output.  Yet msg0 and msg1 are not the same:
> 
> 	printf("%d %d\n", sizeof (msg0), sizeof (msg1));
> 

An important difference is, for any C compiler "that's worth its salt",
msg0 may not *not* be used as an lvalue!

Try this on your compiler:

char	msg0[] = "Hello, world";
char	*msg1 = "Hello, world";

main(argc,argv)
int argc;
char *argv[];

{
    msg1 = msg0;   /* this is ok - msg1 is a pointer, a legal "lvalue"		 */
    msg0 = msg1;   /* this better fail if your compiler is "worth its salt"! */
                   /* msg0 is the address of an array, and may not be        */
                   /* reassigned.  In K&R terms, it may be used only as an   */
				   /* "rvalue", not an "lvalue"								 */
}

I get the following:
x.c(10) : error 106: `=' : left operand must be lvalue


-- 
Walker Mangum                                  |  Adytum, Incorporated
phone: (713) 333-1509                          |  1100 NASA Road One  
UUCP:  uunet!ficc!walker  (walker@ficc.uu.net) |  Houston, TX  77058
Disclaimer: $#!+ HAPPENS

henry@utzoo.uucp (Henry Spencer) (02/09/89)

In article <19742@uflorida.cis.ufl.EDU> thoth@beach.cis.ufl.edu () writes:
>...  The question is, what does the standard say about
>structure and array constants (and arrays of structure
>constants, etc.)?

Absolutely nothing.  The idea was, as I understand it, proposed a number
of times, unsuccessfully.

>I think that to not have the capability
>to declare something like that on the fly is crippled.

Alas, we'll just have to go on using our crippled language (in which many
millions of lines of code have been written quite successfully).

>If
>it isn't in the standard, somebody get it in there.  If it
>IS in the standard, somebody get GNU to add it to their
>compiler...

You have things in the wrong order:  ANSI standards committees are in the
business of standardizing ideas that have already been tried out and found
to be workable.  (There are a number of things in ANSI C that were never
tried in *Unix* C compilers, but very few things that haven't been tried
somewhere by somebody in some C compiler.)  Designing a language -- or even
a single language feature -- is tricky, and feedback from real use of real
implementations is important.  The alternative, having a standards committee
get the bit between its teeth and design something out of thin air, tends
to yield most unpleasant results.  (X3J11's one major foray in that direction
was the disastrous "noalias", fortunately since deleted.)

I think such a thing already exists in the GNU compiler, actually, and if
experience with it is sufficiently positive, it might perhaps end up in the
next revision of the standard.  It's much too late now to mess with the
about-to-be-issued current version.
-- 
Allegedly heard aboard Mir: "A |     Henry Spencer at U of Toronto Zoology
toast to comrade Van Allen!!"  | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

mcdonald@uxe.cso.uiuc.edu (02/10/89)

>You have things in the wrong order:  ANSI standards committees are in the
>business of standardizing ideas that have already been tried out and found
>to be workable. 
Bullshit!

Proof by counterexample: X3J3


>The alternative, having a standards committee
>get the bit between its teeth and design something out of thin air, tends
>to yield most unpleasant results.
Quite true.

Again, proof by X3J3. And what about X3J11's trigraphs?

henry@utzoo.uucp (Henry Spencer) (02/11/89)

In article <225800126@uxe.cso.uiuc.edu> mcdonald@uxe.cso.uiuc.edu writes:
>>You have things in the wrong order:  ANSI standards committees are in the
>>business of standardizing ideas that have already been tried out and found
>>to be workable. 
>
>Bullshit!
>Proof by counterexample: X3J3

I didn't say that they always *stuck* to what their business was supposed
to be.  Sometimes they don't, and the results are icky.  Actually, my
recollection is that for Fortran 77, X3J3 kept fairly close to things
that had already been tried out in various preprocessors or variant
implementations; it's only since then that they've gone off the deep end.
If you want a *really* bad example, I'm told that ANSI BASIC is noteworthy.

>... And what about X3J11's trigraphs?

If you look at the wording of my posting, you'll see that I didn't say
X3J11 had kept entirely to existing practice, just mostly.  (And despite
all the screaming about them, the fact is that trigraphs are a minor
nuisance to implement and are most unlikely to ever bother users much.
Note that X3J11 has explicitly rejected proposals, even one backed by
a threat of ISO disapproval, to make trigraphs more elaborate.  I'm not
in love with X3J11 trigraphs, but the issue isn't worth all the fuss that's
been made about it.  At worst they are a minor and fairly benign mistake.)
-- 
The Earth is our mother;       |     Henry Spencer at U of Toronto Zoology
our nine months are up.        | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

bright@Data-IO.COM (Walter Bright) (02/14/89)

In article <1989Feb10.191041.12109@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>I'm not
>in love with X3J11 trigraphs, but the issue isn't worth all the fuss that's
>been made about it.  At worst they are a minor and fairly benign mistake.)

The trouble with trigraphs is that they, along with the 'phases of
translation' rules, require an extra test for each character of source
read. A perfect scanner examines each source character exactly once, thus
we can measure the perfection of a scanner by on average how many times
each character is tested. A good scanner is about 1.1 on this scale.
Trigraphs push it over 2.0.

The reason this is a problem is because most of the time spent in a
compiler is in the reading of source text and splitting it into tokens.
(If this isn't so, then your symbol table implementation is botched or
something else is.) I've found in my compiler (Zortech) that ONE extra
instruction executed per char read slows down the compiler by 5 to 10%.

It's irritating to have to implement a feature that nobody in their right
mind is going to use, and that has such a negative impact on the product.

So who cares about 10%? I do, a large percentage of my life is spent
waiting for compiles. My customers do too, it's a big issue for them.
Besides, 10% here, 10% there, 15% somewhere else, and it adds up to
a pig for a compiler. Programs are made fast by squeezing everywhere
possible (yes, I use profilers).

To digress for a moment, I'm well aware of the rule that 90% of the
execution time is spent in 10% of the code. This is true, however, of
programs BEFORE profiling and fixing of that 10% occurs. Things flatten
out a lot after that.

henry@utzoo.uucp (Henry Spencer) (02/15/89)

In article <1875@dataio.Data-IO.COM> bright@dataio.Data-IO.COM (Walter Bright) writes:
>The trouble with trigraphs is that they, along with the 'phases of
>translation' rules, require an extra test for each character of source...
>The reason this is a problem is because most of the time spent in a
>compiler is in the reading of source text and splitting it into tokens...

A compiler that spends most of its time tokenizing source obviously isn't
working very hard at code generation.  An optimizing compiler, or even a
non-pessimizing compiler, is not going to be tokenizing-bound, unless
there's been a remarkable leap of compiler technology while I wasn't
watching.  Also:

>...(yes, I use profilers).... I'm well aware of the rule that 90% of the
>execution time is spent in 10% of the code. This is true, however, of
>programs BEFORE profiling and fixing of that 10% occurs. Things flatten
>out a lot after that.

How thoroughly flattened is your compiler?

>... I've found in my compiler (Zortech) that ONE extra
>instruction executed per char read slows down the compiler by 5 to 10%.

Hmm, that's 10-20 instructions per character, with C typically about
20 chars/line, and with even a decidedly slow machine delivering an
instruction per microsecond, gives us 2500+ lines/second.  Is your
compiler really that fast?  I'm surprised.

>It's irritating to have to implement a feature that nobody in their right
>mind is going to use, and that has such a negative impact on the product...

If it's that big a deal, have you considered having a default "no trigraphs"
mode and a slower "trigraphs" mode?  That way, if nobody uses it, there's
no impact except for a bit of code that never gets executed.
-- 
The Earth is our mother;       |     Henry Spencer at U of Toronto Zoology
our nine months are up.        | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

bright@Data-IO.COM (Walter Bright) (02/16/89)

In article <1989Feb14.161906.16138@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <1875@dataio.Data-IO.COM> bright@dataio.Data-IO.COM (Walter Bright) writes:
>A compiler that spends most of its time tokenizing source obviously isn't
>working very hard at code generation.
The optimizer is a separate pass, it's designed that way so the user has the
choice of fastcompile/slowexecute or slowcompile/fastexecute.
BTW, my code generator is very efficient (it's not table driven, it's all
ad-hoc inline stuff, and is heavily optimized).
>>... I've found in my compiler (Zortech) that ONE extra
>>instruction executed per char read slows down the compiler by 5 to 10%.
>Hmm, that's 10-20 instructions per character, with C typically about
>20 chars/line, and with even a decidedly slow machine delivering an
>instruction per microsecond, gives us 2500+ lines/second.
I tested it on my own code, which includes lots of comments, macro defs
and extern defs (all the .h files). Most of the identifiers are
relatively long. The number of lines in the .h files dwarf the number
of lines in the .c files. None of this stuff goes through the
code generator, so my results are different from that of code which
consists mosly of expressions. The ideal is to get the speed of processing
white space, comments, and false conditionals to approximate the speed
of simply reading characters from a file.
>If it's that big a deal, have you considered having a default "no trigraphs"
>mode and a slower "trigraphs" mode?  That way, if nobody uses it, there's
>no impact except for a bit of code that never gets executed.
That's the way I decided to implement it. The main difficulty, however, with
this approach is that magazine C compiler reviewers frequently don't read
the manual, and may simply run the compiler with the default settings, and
wrongly conclude that it doesn't support trigraphs. For example, the
latest BYTE review feature list for Zortech C contains numerous errors, all
resulting from the reviewers not reading the manual.

aglew@mcdurb.Urbana.Gould.COM (02/19/89)

>The reason this is a problem is because most of the time spent in a
>compiler is in the reading of source text and splitting it into tokens.
>(If this isn't so, then your symbol table implementation is botched or
>something else is.) 

(Or your compiler doesn't do very much work optimising.)

thoth@beach.cis.ufl.edu (Robert Forsman) (02/19/89)

  Excuse me, but I think my original topic has been corrupted (note I
fixed the subject line).  OK, I learned how to use an array on the
fly.

  if (intcmp(command,(int[]){GET,TORCH,0})==0) { ...

  Cool, but it isn't in ANSI (pat on the back to GNU though).  Now
here's one that I'm sure will blow your socks off.

{
  int a,b;

  scanf("%d %d",&a,&b);
  if (intcmp(command,(int[]){a,b,0})==0) { ...

  What a nifty new razorblade!  You have an array out in writeable
data space ready to accept any values and the code writes in a, b and
0 just before it needs them.  I'd like to see this added.  Really it
would just be a compression of

 {
  int a,b,goober[3];

  scanf("%d %d",&a,&b);
  goober[0] = a;
  goober[1] = b;
  goober[2] = 0;  /* should be unnecessary up above, since it's always 0 */
  if (intcmp(command,goober)==0) { ...

Anyone want to lobby to include it?


  As far as corrupting my topic goes, how did we get from array
declaration on the fly to token parsing (I only look at the subject
line when I'm junking a series of stupid postings or intelligent
questions outside my sphere of interest.)?  I'm curious.

/* of course it's my own opinion, did you see someone else post it? */
   Just say maybe to .signature