[comp.std.c] Character sets

iiit-sh@cybaswan.UUCP (Steve Hosgood) (04/29/89)

Several people have been talking about Trigraphs recently. Danes, Swedes,
Icelanders and others have discussed at length whether or not they (the
potential benefactors of such a scheme) actually *want* or *need* the damn
things anyway.

Now IMHO, we're seeing here the consequences of restricting the world's
computer users to a 7-bit coding system originally designed just for American
English. Surely it would be better for ANSI to scrap formally the concept of
7-bit coding and move to better things? As I understand it, the reason for
7-bits in the old days was so that the character and a parity bit would fit
into a byte. These days though, as far as I know, all ACIA chips will happily
send 8-bits and parity - though most people disable the parity anyway!


I've got an article in front of me from Scientific American in 1983(ish),
though I don't know the exact date as it's a photocopy. Anyway, it's pages
82 thru' 93 and written by Joseph D. Becker of Xerox Corporation, and is
entitled "Multilingual Word Processing". It seems a lot of work has been
done on beating the problems of handling the world's languages by means of
switching of character sets. Xerox seem chiefly interested in word processing,
but it's obvious that the same ideas could be used in E-mail, and presumably
language source-code as well.

[ ** in case you didn't see the article **
The idea is that you define 8-bit alphabets, and reserve the character 0xFF
to indicate "next byte is an alphabet identifier". This allows you to switch
from one character set to another in mid-text very easily. I get the feeling
that the alphabets are designed to have shared sections, so that the codes
0x00 thru 0x7F print the same in the 'Roman/Hebrew' set as they do in the
'Roman/Esperanto' set for instance. Obviously the several alphabets needed for
Chinese will not have any commonality with the Roman stuff though.
]

I don't think you'd have to go as far as switched character sets to solve the
problem of dealing with *most* of the Northern European and North American
languages. Just look at the IBM-PC character set for instance. However it
would be nice to think ahead a bit and allow for the Greeks, Russians, Chinese
and Japanese.

The result of moving in this direction would be that people with old Danish
terminals would see the unrepresentable characters on screen as trigraphs, and
would type them as such, but the trigraphs are a local product of the computer's
TTY handler. What would appear in the source-code file would be the 8-bit
Northern Europe/USA code for '{' or whatever he wanted. If someone in the USA
wanted to use a 'yen' symbol, he'd have to type a trigraph for it, which
would cause an alphabet-shift code to appear in the source file to cater for
it. Someone in Japan reading that file would just see a 'yen' symbol.

OK, well it's *far* too late for such ideas to be submitted to X3J11 now, but
did anyone mention it in the early days, *before* it was too late?
Actually, it's not an X3J11 problem if you put responsibility for trigraphs
into the TTY handler. Whose problem would it be?

-----------------------------------------------+------------------------------
Steve Hosgood BSc,                             | Phone (+44) 792 295213
Image Processing and Systems Engineer,         | Fax (+44) 792 295532
Institute for Industrial Information Techology,| Telex 48149
Innovation Centre, University of Wales, +------+ JANET: iiit-sh@uk.ac.swan.pyr
Swansea SA2 8PP                         | UUCP: ..!ukc!cybaswan.UUCP!iiit-sh
----------------------------------------+-------------------------------------
            My views are not necessarily those of my employers!
	"Traditional Japanese Theatre? Just say Noh" - not Nancy Reagan

prc@maxim.ERBE.SE (Robert Claeson) (05/03/89)

In article <373@cybaswan.UUCP>, iiit-sh@cybaswan.UUCP (Steve Hosgood) writes:

> The idea is that you define 8-bit alphabets, and reserve the character 0xFF
> to indicate "next byte is an alphabet identifier". This allows you to switch
> from one character set to another in mid-text very easily.

Actually, there's an ISO standard for this that uses the SS3 8-bit control
character. I'm afraid I can't remember the name of the standard, but AT&T plans
to use it in the SVR4 tty device driver anyway.
-- 
Robert Claeson, ERBE DATA AB, P.O. Box 77, S-175 22 Jarfalla, Sweden
Tel: +46 (0)758-202 50  Fax: +46 (0)758-197 20
EUnet:   rclaeson@ERBE.SE               uucp:   {uunet,enea}!erbe.se!rclaeson
ARPAnet: rclaeson%ERBE.SE@uunet.UU.NET  BITNET: rclaeson@ERBE.SE

gwyn@smoke.BRL.MIL (Doug Gwyn) (05/03/89)

In article <373@cybaswan.UUCP> iiit-sh@cybaswan.UUCP (Steve Hosgood) writes:
>OK, well it's *far* too late for such ideas to be submitted to X3J11 now, ...

	"It was always too late for that!"
		- Walt Kelly

>did anyone mention it in the early days, *before* it was too late?

You appear to have the C language standardization committee confused
with character code set standardization committees.  In fact there
have been numerous attempts to deal with this problem, and numerous
standards have resulted, including some ISO-sanctioned ones.  They
don't quite follow the Xerox approach you cited, but there are some
recognizable similarities, particularly in some of the Japanese work.

henry@utzoo.uucp (Henry Spencer) (05/03/89)

In article <373@cybaswan.UUCP> iiit-sh@cybaswan.UUCP (Steve Hosgood) writes:
>Several people have been talking about Trigraphs recently...
>Now IMHO, we're seeing here the consequences of restricting the world's
>computer users to a 7-bit coding system originally designed just for American
>English. Surely it would be better for ANSI to scrap formally the concept of
>7-bit coding and move to better things? ...

You've got the problem exactly backwards.  ANSI C, and most other language
projects now current, are perfectly happy to assume 8-bit character sets.
The problem is that the *complainers* have 7-bit equipment that uses a
different 7-bit standard, and *they* don't want to be forced to upgrade.
They want officially-blessed, easy-to-read ways to write ANSI C using
their own old equipment.  (What next, an ANSI C encoding for the IBM Model
26 keypunch?!?)

There is in fact a standard set of 8-bit character sets, the ISO Latin sets,
that solve this problem completely -- each one has full ASCII as a subset.
ISO Latin 1 covers essentially all the Western European languages (there
is some small problem with Welsh that slipped through by accident), in
particular.  There are standard shift sequences to reach other alphabets.
(Although shifts are an enormous pain in string manipulation, which is
why ANSI C recognizes the notion of "wide character" to deal with such
things internally as unshifted codes.)  Someday the terminals etc. will
speak ISO Latin, and that will solve this set of problems.  (Then we'll
have the oriental languages to deal with... the existing code-extension
hooks can cope in theory, but in practice it's cumbersome.)
-- 
Mars in 1980s:  USSR, 2 tries, |     Henry Spencer at U of Toronto Zoology
2 failures; USA, 0 tries.      | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

iiit-sh@cybaswan.UUCP (Steve Hosgood) (05/15/89)

In article <10194@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes:
>You appear to have the C language standardization committee confused
>with character code set standardization committees.  In fact there
>have been numerous attempts to deal with this problem...

Sorry about the time delay in replying to this, but our newsfeed went missing
for a week..

The point I wanted to make (and I didn't express myself very well) was that
surely the C language standardization committee has confused its brief with
that of the character set standardization committee?

In article <397@cybaswan.UUCP> Henry Spencer writes:
> You've got the problem exactly backwards.  ANSI C, and most other language
> projects now current, are perfectly happy to assume 8-bit character sets.
> The problem is that the *complainers* have 7-bit equipment that uses a
> different 7-bit standard, and *they* don't want to be forced to upgrade.
> They want officially-blessed, easy-to-read ways to write ANSI C using
> their own old equipment.  (What next, an ANSI C encoding for the IBM Model
> 26 keypunch?!?)
> 

Exactly, Henry, but again I come back to the question of whose problem are
we talking about? Typing curly brackets, pipe symbols, and hashes on 7-bit
equipment is surely a problem that is general to modern computing -
many languages use such characters, the C-shell uses them, some editors
use them in commands, etc, etc.

Henry Spencer continues:
> ...... Someday the [Danish and other] terminals etc. will
> speak ISO Latin, and that will solve this set of problems. 

Yeah, but C compilers will end up carting around trigraphs in their lexical
analysers for evermore....

Sorry to bring this up again folks, but I'm *still* unhappy. The 'UCASE' hack
to allow UN*X to work on silly old terminals was put into the TTY handler. So
I believe should this trigraph thingy.

Steve

-----------------------------------------------+------------------------------
Steve Hosgood BSc,                             | Phone (+44) 792 295213
Image Processing and Systems Engineer,         | Fax (+44) 792 295532
Institute for Industrial Information Techology,| Telex 48149
Innovation Centre, University of Wales, +------+ JANET: iiit-sh@uk.ac.swan.pyr
Swansea SA2 8PP                         | UUCP: ..!ukc!cybaswan.UUCP!iiit-sh
----------------------------------------+-------------------------------------
            My views are not necessarily those of my employers!

gwyn@smoke.BRL.MIL (Doug Gwyn) (05/18/89)

In article <442@cybaswan.UUCP> iiit-sh@cybaswan.UUCP (Steve Hosgood) writes:
>surely the C language standardization committee has confused its brief with
>that of the character set standardization committee?

Not with respect to trigraphs; far from requiring sufficient character
set support, X3J11 bent over backwards to require only the minimum
practical character set according to already extant international
standards.

However, X3J11 did mandate that the character values for '0'..'9' have
adjacent values in ascending numerical order.  That is clearly a code
set requirement, which I argued against.  The need for some way to
map digit characters to numbers and vice versa does exist, but other
means to meet this need could have been specified.  For example, my
standard application system-tailored configuration header contains
the following, which ANSI C conforming implementations must support:

/* integer (or character) arguments and value: */
#define tonumber( c )	((c) - '0')	/* convt digit char to number */
#define todigit( n )	((n) + '0')	/* convt digit number to char */

Of course this is edited as required to mach the actual implementation.
For years, I had been using the portable definition

#define todigit( n )	"0123456789"[n]	/* convt digit number to char */

but I never figured out a really good portable definition for tonumber().
I would much rather X3J11 have standardized macros like these than
imposing requirements on the code set.  The X3J11 requirement can "work"
only because all known implementations happen to already meet the
requirement.  If they didn't, it would be impractical to fix them!

>The 'UCASE' hack to allow UN*X to work on silly old terminals was put
>into the TTY handler. So I believe should this trigraph thingy.

Not every system has such facilities, but I agree with your general
sentiment.  In fact I expect that some of the more enlightened
implementors will take exactly this tack to deal with practical use
of so-called "European character sets".  The new ISO code set standards
should also help.  C trigraphs should remain essentially an inter-site
code transporting aid.

iiit-sh@cybaswan.UUCP (Steve Hosgood) (05/23/89)

In article <10284@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes:
>However, X3J11 did mandate that the character values for '0'..'9' have
>adjacent values in ascending numerical order.  That is clearly a code
>set requirement, which I argued against.  The need for some way to
>map digit characters to numbers and vice versa does exist, but other
>means to meet this need could have been specified.

Seems like a job for <ctype.h> to me. Interesting though, I had never
considered the possibility of non-contiguous numbers and alphabetics rearing
its head now that EBCDIC is dead (slight :-)).

>>The 'UCASE' hack to allow UN*X to work on silly old terminals was put
>>into the TTY handler. So I believe should this trigraph thingy.
>Not every system has such facilities, but I agree with your general
>sentiment.  In fact I expect that some of the more enlightened
>implementors will take exactly this tack to deal with practical use
>of so-called "European character sets".

But if this trigraph thing gets into the standard, then *all* conforming
compilers will *have* to have the code in their lexical analysers. As you
say, enlightened (:-)) implementors will probably deal with the problem in
the handler, but the compiler carries the baggage around for evermore *as well*.

>The new ISO code set standards should also help.

I certainly hope so. Presumably the C standard allows for 8-bit character sets?
Also, what about such things as allowable characters in identifiers and such
like? Just yesterday, I was writing a program where I would have liked to have
used Greek characters as identifiers. Is that sort of thing permissable?
Would 'toupper' return upper-case Epsilon if given lower-case epsilon as an
argument?

It's a tricky can of worms, and it gets worse the closer you look at it.
Steve

gwyn@smoke.BRL.MIL (Doug Gwyn) (05/27/89)

In article <456@cybaswan.UUCP> iiit-sh@cybaswan.UUCP (Steve Hosgood) writes:
>I certainly hope so. Presumably the C standard allows for 8-bit character sets?

Understatement of the decade.

>Also, what about such things as allowable characters in identifiers and such
>like? Just yesterday, I was writing a program where I would have liked to have
>used Greek characters as identifiers. Is that sort of thing permissable?

The Standard does not permit use of any character other than _, 0-9, a-z,
and A-Z in identifiers, although comments may contain just about anything.

>Would 'toupper' return upper-case Epsilon if given lower-case epsilon as an
>argument?

That's implementation- and locale-dependent.  In the default ("C")
locale, only 'a' through 'z' are mapped into different characters by
toupper().

giguere@aries5.uucp (Eric Giguere) (05/27/89)

In article <456@cybaswan.UUCP> iiit-sh@cybaswan.UUCP (Steve Hosgood) writes:
>                                      Interesting though, I had never
>considered the possibility of non-contiguous numbers and alphabetics rearing
>its head now that EBCDIC is dead (slight :-)).

With all those IBM mainframes (you know, the ones that run VM/CMS, MVS, TSO,
etc. -- the operating systems that everyone in the papers is advertising
positions for all the time....) the demise of EBCDIC is still a looooooong
way off.  Remember that Whitesmiths is on the executive of the ANSI committee
and they market a C compiler for those same machines....

....and so do we in Waterloo C.  So programming one of those mainframes doesn't
have to be so bad since there's a real language available....

Eric Giguere                                  268 Phillip St #CL-46
For the curious: it's French ("jee-gair")     Waterloo, Ontario  N2L 6G9
Bitnet  : GIGUERE at WATCSG                   (519) 746-6565
Internet: giguere@aries5.UWaterloo.ca         "Nothing but urges from HELL!!"

rbutterworth@watmath.waterloo.edu (Ray Butterworth) (05/30/89)

In article <10331@smoke.BRL.MIL>, gwyn@smoke.BRL.MIL (Doug Gwyn) writes:
> In article <456@cybaswan.UUCP> iiit-sh@cybaswan.UUCP (Steve Hosgood) writes:
> >Also, what about such things as allowable characters in identifiers and such
> >like? Just yesterday, I was writing a program where I would have liked to have
> >used Greek characters as identifiers. Is that sort of thing permissable?
> 
> The Standard does not permit use of any character other than _, 0-9, a-z,
> and A-Z in identifiers, although comments may contain just about anything.

Note that the Standard is defined in terms of what a compiler must do
with a conforming program.  It does not dictate much about what a
compiler must do with programs that do not conform to the Standard.

In particular, it does not prevent any standard compiler from accepting
identifiers with other characters in them so long as those characters
could not legally appear in the same place in a conforming program.

e.g. An ANSI compiler could not accept "." in identifers, since a
conforming program could have "a.b" and that must be parsed as "a . b".
But an ANSI compiler is allowed to accept as an extension an identifier
with an umlauted U in it, although no such program can be considered as
conforming to the Standard (one would expect the compiler to have
 an option that enables warnings about such non-standard extensions,
 so that software written with German identifiers could be cleaned
 up to conform to the Standard before it is distributed to other
 compilers).

henry@utzoo.uucp (Henry Spencer) (05/30/89)

In article <26621@watmath.waterloo.edu> rbutterworth@watmath.waterloo.edu (Ray Butterworth) writes:
>Note that the Standard is defined in terms of what a compiler must do
>with a conforming program.  It does not dictate much about what a
>compiler must do with programs that do not conform to the Standard.
>
>In particular, it does not prevent any standard compiler from accepting
>identifiers with other characters in them so long as those characters
>could not legally appear in the same place in a conforming program.
>
>...one would expect the compiler to have
> an option that enables warnings about such non-standard extensions...

Note the wording in 2.1.1.3, which implies (subject to the interpretation
of some of the terms) that a compiler is in fact *required* to produce at
least one warning for any input file that violates the Standard's syntax
rules or constraints.  That doesn't mean it has to refuse to compile it,
mind you.
-- 
Van Allen, adj: pertaining to  |     Henry Spencer at U of Toronto Zoology
deadly hazards to spaceflight. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

diamond@jit345.swstokyo.dec.com (Norman Diamond) (04/04/91)

In the recent exchanges between keld@login.dkuug.dk (Keld J|rn Simonsen)
and gwyn@smoke.brl.mil (Doug Gwyn), there is evidence of misunderstanding
on both sides.  Mr. Simonsen missed a subtle English ambiguity in an ANSI
response that Mr. Gwyn quotes, and went off on a tangent.  Mr. Gwyn missed
a more important technical issue, and appears to have gotten all defensive
about Mr. Simonsen's complaint instead of recognizing the technical point.

dg>THE MAPPING BETWEEN INTERNAL CODE SET VALUES AND THE ENVIRONMENT CAN,
dg>AND SHOULD, BE DEFINED BY THE C IMPLEMENTATION.

ks>This I read to have the consequences that all characters in the
ks>source C program and every input and output file SHOULD have
ks>conversions applied to all widechar strings. This could be done in the
ks>mbstowcs() and other mb/wc functions.

Even ordinary characters (non-widechars) might need conversions.

ks>Thus the internal widechar representation of 'c' and the external
ks>multibyte representation SHOULD not be the same for character sets
ks>like ISO 10646, JIS X 0208, KS C 5601 and GB 2312.
ks>At least this should hold for characters in the C character set.

It seems to me that this is exactly the reason for the mb/wc functions.
However, something is missing for ordinary characters.

ks>ANSI and DS (Danish Standards, the Danish equivalent to ANSI in ISO)
ks>have disagreed on the necessity of a *readable* and *writeable* alternative
ks>to a representation of C source in invariant ISO 646.

Exactly.

dg>Indeed, X3J11 explained this to you in the third public review response
dg>document.

It did not.

dg>	In response to Letter #177, Doc. No. X3J11/88-134:
dg>	Summary of issue:
dg>		Proposal for more readable supplement to trigraphs.

Confusion of issue.

dg>	X3J11 response:
[...]
dg>	Trigraphs were intended to provide a universally portable format
dg>	for the *transmission* of C programs; they were never intended
dg>	to be used for day-to-day reading and writing of programs.

Exactly.  However, a readable and writable alternative is ALSO necessary.
By co-incidence (well, not entirely, but can be treated that way), it
would also make trigraphs unnecessary.  A readable and writable alternative
is necessary for programming, regardless of whether transmission is done.

[...]
dg>		Translation phase 1 actually consists of two parts, first
dg>	the mapping (about which we say very little) from the external
dg>	host character set to the C source character set, then the
dg>	replacement of C source trigraph sequences with single C source
dg>	characters.  (Note that the C source characters represented in
dg>	our documents in Courier font need not appear graphically the
dg>	same in the host environment, although a reasonable
dg>	implementation will make them as nearly so as possible.)
dg>	The kind of mapping you propose can in fact be done in the first
dg>	part of translation phase 1, and several such "convenience"
dg>	mappings are already common practice.  However, attempting to
dg>	standardize this mapping is outside the scope of the C Standard,
dg>	since what is appropriate may depend on the capabilities of the
dg>	specific hardware, availability of fonts, and so forth.

ks>This I read as just a story with irrelevant facts about fonts
ks>(Courier etc.).

No.  The clause about fonts was not intended to give importance to fonts
etc.  It is intended only to help identify which characters in the standard
are C source characters, as opposed to those which are English or BNF.
The main thrust of the statement is that C source characters do not have
to look like Roman letters and punctuation marks in the host environment,
though a "reasonable" implementation would do "as nearly so as possible."

(I find this use of the word "reasonable" offensive, an innuendo bordering
on racism.  Many programmers would not like programming in C on a machine
that does not have Roman letters, but that does not make the machine, or
an implementation of C thereon, unreasonable.)

ks>This X3J11 judgement is not at all "on technical issues" - all technical
ks>issues are admitted to be solveable! The decision is a political one.

If the misunderstanding was cleared up, then it was political.  However,
it is not clear from the postings in this group, whether the misunderstandings
were cleared up in time.

ks>The reason why the Japanese have not seen the problem before with
ks>JIS X 0208, but first with 10646, is beyond my understanding.
ks>Maybe some Japanese could enlighten us (me!) on this?

Maybe.  This Japanese resident can make a stab in the dark which sounds
plausible:  In JIS 2.6, there was no problem with ASCII characters.
A byte which had its high bit 0 was not part of a JIS 2.6 character.
It is possible that 0208 didn't have much of a following until recently.
(I don't know 0208 at all, so have to take the words of others about the
problems that arise.)
--
Norman Diamond       diamond@tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.

decot@hpisod2.cup.hp.com (Dave Decot) (04/09/91)

A coded character set should be considered to be an agreement between
two or more of the following parties:

    the programming language processor that converts a particular character
    specification into a binary bit pattern

    the font-drawing (or dotmatrix-selection) software in the output
    device on which the bit pattern will be ultimately be translated
    into a visual character form

    the input device from which the bit pattern will be expected to
    be generated in response to selection of that character

The solutions must thus be addressed to two or more of these areas.

Dave Decot

erik@srava.sra.co.jp (Erik M. van der Poel) (04/10/91)

Norman Diamond writes:
> In JIS 2.6, there was no problem with ASCII characters.

What is JIS 2.6?


> It is possible that 0208 didn't have much of a following until recently.

There are three versions of JIS X 0208:

	JIS X 0208-1978
	JIS X 0208-1983
	JIS X 0208-1990

In 1987(?), they had a grand reorganization of the names. Prior to the
renaming, JIS X 0208 was:

	JIS C 6226-1978
	JIS C 6226-1983

It is safe to say that 0208 has had a following for quite a few years.
-
-- 
Erik M. van der Poel                                      erik@sra.co.jp
Software Research Associates, Inc., Tokyo, Japan     TEL +81-3-3234-2692

harkcom@spinach.pa.yokogawa.co.jp (04/11/91)

In article <1108@sranha.sra.co.jp> erik@srava.sra.co.jp
   (Erik M. van der Poel) writes:

 =}It is safe to say that 0208 has had a following for quite a few years.

   Amongst the standards comittee it seems to have had a decade+ of
following, but how many programmers actually use JIS in programming.
The capability of handling it exists in many packages, but I think it's
rarely used as an internal encoding (I think SJIS is used the most, but
I'm not an expert).

   I think Mr. Diamond's point is valid even in reference to JIS X 0208.
The ASCII characters are not a part of the standard and are handled as
ASCII...

Al