[net.internat] Suggested ANSI C changes for international character set support

gnu@l5.uucp (John Gilmore) (10/22/85)
Something just occurred to me while thinking about how to define a
version of stdio that works with 16-bit characters.  I don't think we
should consider adding a "long char" type to the language, but a
few changes will need to occur to make wide characters easy to program
with.

The problem that came up is how to write strings, eg those fed to
printf, for printw.  (Printw ["printworld" or "printwide"] would be a
"printf" designed to handle the full international character set.)
What is needed is a way to initialize an array of 16-bit values
(shorts) with a string, as in:

char  string[] = "The value is %d\n";
   versus
short string[] = "The value is %d\n";

This doesn't work in today's C compilers, though I haven't seen a
particular reason why it shouldn't.  I see no reason not to make

float string[] = "abc";

assign the values 97., 98., and 99., and I suggest that in general,
an initializer like "abc" be standardized as exactly equivalent to
{'a', 'b', 'c'}  for ALL types.

There is also the question of what to do about character constants in
expressions: how do you make "short" character strings instead of
"char" character strings, or indicate that '%' means a short '%' rather
than a char '%'?  (replace % with your favorite Chinese glyph.)

I suggest that a trailing letter do this, as is currently done for long
integer constants.  If  37L  works, why not  "foo"L  or  "foo"S  or
"foo"C?  Similarly,  'a'L  and  'a'S  and  'a'C, where C is the default
as now.

There is a slight wrinkle in saying that "abc" is equivalent to {'a',
'b', 'c'}.  In a 16-bit character set, 'a' must be the first
*character* that appears after the ", not just the first *byte*.  C
compilers which are written assuming in 8-bit input characters (and
which support strings > 8 bits) must run their strings thru a
conversion routine to get long values for internal use.  The conversion
routine comes from stdio, since stdio will need it for I/O.  An example:

short w[] = "chinese";

(insert 7 chinese letters in place of "chinese") would be tokenized and
stored internal to the compiler as a 7-element string, so that the
resulting array <w> would be an array of 7 shorts.  Meanwhile, 

char c[] = "chinese";

would be tokenized the same way, but the resulting array <c> might be
an array of 20 chars, since each of the 7 glyphs requires more than one
8-bit char to hold it.  

This lets programs use the entire character set, either via encoded byte
strings or via true wide character processing, whichever is more convenient.