gnu@l5.uucp (John Gilmore) (10/22/85)
Something just occurred to me while thinking about how to define a version of stdio that works with 16-bit characters. I don't think we should consider adding a "long char" type to the language, but a few changes will need to occur to make wide characters easy to program with. The problem that came up is how to write strings, eg those fed to printf, for printw. (Printw ["printworld" or "printwide"] would be a "printf" designed to handle the full international character set.) What is needed is a way to initialize an array of 16-bit values (shorts) with a string, as in: char string[] = "The value is %d\n"; versus short string[] = "The value is %d\n"; This doesn't work in today's C compilers, though I haven't seen a particular reason why it shouldn't. I see no reason not to make float string[] = "abc"; assign the values 97., 98., and 99., and I suggest that in general, an initializer like "abc" be standardized as exactly equivalent to {'a', 'b', 'c'} for ALL types. There is also the question of what to do about character constants in expressions: how do you make "short" character strings instead of "char" character strings, or indicate that '%' means a short '%' rather than a char '%'? (replace % with your favorite Chinese glyph.) I suggest that a trailing letter do this, as is currently done for long integer constants. If 37L works, why not "foo"L or "foo"S or "foo"C? Similarly, 'a'L and 'a'S and 'a'C, where C is the default as now. There is a slight wrinkle in saying that "abc" is equivalent to {'a', 'b', 'c'}. In a 16-bit character set, 'a' must be the first *character* that appears after the ", not just the first *byte*. C compilers which are written assuming in 8-bit input characters (and which support strings > 8 bits) must run their strings thru a conversion routine to get long values for internal use. The conversion routine comes from stdio, since stdio will need it for I/O. An example: short w[] = "chinese"; (insert 7 chinese letters in place of "chinese") would be tokenized and stored internal to the compiler as a 7-element string, so that the resulting array <w> would be an array of 7 shorts. Meanwhile, char c[] = "chinese"; would be tokenized the same way, but the resulting array <c> might be an array of 20 chars, since each of the 7 glyphs requires more than one 8-bit char to hold it. This lets programs use the entire character set, either via encoded byte strings or via true wide character processing, whichever is more convenient.