iiit-sh@cybaswan.UUCP (Steve Hosgood) (04/29/89)
Several people have been talking about Trigraphs recently. Danes, Swedes, Icelanders and others have discussed at length whether or not they (the potential benefactors of such a scheme) actually *want* or *need* the damn things anyway. Now IMHO, we're seeing here the consequences of restricting the world's computer users to a 7-bit coding system originally designed just for American English. Surely it would be better for ANSI to scrap formally the concept of 7-bit coding and move to better things? As I understand it, the reason for 7-bits in the old days was so that the character and a parity bit would fit into a byte. These days though, as far as I know, all ACIA chips will happily send 8-bits and parity - though most people disable the parity anyway! I've got an article in front of me from Scientific American in 1983(ish), though I don't know the exact date as it's a photocopy. Anyway, it's pages 82 thru' 93 and written by Joseph D. Becker of Xerox Corporation, and is entitled "Multilingual Word Processing". It seems a lot of work has been done on beating the problems of handling the world's languages by means of switching of character sets. Xerox seem chiefly interested in word processing, but it's obvious that the same ideas could be used in E-mail, and presumably language source-code as well. [ ** in case you didn't see the article ** The idea is that you define 8-bit alphabets, and reserve the character 0xFF to indicate "next byte is an alphabet identifier". This allows you to switch from one character set to another in mid-text very easily. I get the feeling that the alphabets are designed to have shared sections, so that the codes 0x00 thru 0x7F print the same in the 'Roman/Hebrew' set as they do in the 'Roman/Esperanto' set for instance. Obviously the several alphabets needed for Chinese will not have any commonality with the Roman stuff though. ] I don't think you'd have to go as far as switched character sets to solve the problem of dealing with *most* of the Northern European and North American languages. Just look at the IBM-PC character set for instance. However it would be nice to think ahead a bit and allow for the Greeks, Russians, Chinese and Japanese. The result of moving in this direction would be that people with old Danish terminals would see the unrepresentable characters on screen as trigraphs, and would type them as such, but the trigraphs are a local product of the computer's TTY handler. What would appear in the source-code file would be the 8-bit Northern Europe/USA code for '{' or whatever he wanted. If someone in the USA wanted to use a 'yen' symbol, he'd have to type a trigraph for it, which would cause an alphabet-shift code to appear in the source file to cater for it. Someone in Japan reading that file would just see a 'yen' symbol. OK, well it's *far* too late for such ideas to be submitted to X3J11 now, but did anyone mention it in the early days, *before* it was too late? Actually, it's not an X3J11 problem if you put responsibility for trigraphs into the TTY handler. Whose problem would it be? -----------------------------------------------+------------------------------ Steve Hosgood BSc, | Phone (+44) 792 295213 Image Processing and Systems Engineer, | Fax (+44) 792 295532 Institute for Industrial Information Techology,| Telex 48149 Innovation Centre, University of Wales, +------+ JANET: iiit-sh@uk.ac.swan.pyr Swansea SA2 8PP | UUCP: ..!ukc!cybaswan.UUCP!iiit-sh ----------------------------------------+------------------------------------- My views are not necessarily those of my employers! "Traditional Japanese Theatre? Just say Noh" - not Nancy Reagan
prc@maxim.ERBE.SE (Robert Claeson) (05/03/89)
In article <373@cybaswan.UUCP>, iiit-sh@cybaswan.UUCP (Steve Hosgood) writes: > The idea is that you define 8-bit alphabets, and reserve the character 0xFF > to indicate "next byte is an alphabet identifier". This allows you to switch > from one character set to another in mid-text very easily. Actually, there's an ISO standard for this that uses the SS3 8-bit control character. I'm afraid I can't remember the name of the standard, but AT&T plans to use it in the SVR4 tty device driver anyway. -- Robert Claeson, ERBE DATA AB, P.O. Box 77, S-175 22 Jarfalla, Sweden Tel: +46 (0)758-202 50 Fax: +46 (0)758-197 20 EUnet: rclaeson@ERBE.SE uucp: {uunet,enea}!erbe.se!rclaeson ARPAnet: rclaeson%ERBE.SE@uunet.UU.NET BITNET: rclaeson@ERBE.SE
gwyn@smoke.BRL.MIL (Doug Gwyn) (05/03/89)
In article <373@cybaswan.UUCP> iiit-sh@cybaswan.UUCP (Steve Hosgood) writes: >OK, well it's *far* too late for such ideas to be submitted to X3J11 now, ... "It was always too late for that!" - Walt Kelly >did anyone mention it in the early days, *before* it was too late? You appear to have the C language standardization committee confused with character code set standardization committees. In fact there have been numerous attempts to deal with this problem, and numerous standards have resulted, including some ISO-sanctioned ones. They don't quite follow the Xerox approach you cited, but there are some recognizable similarities, particularly in some of the Japanese work.
henry@utzoo.uucp (Henry Spencer) (05/03/89)
In article <373@cybaswan.UUCP> iiit-sh@cybaswan.UUCP (Steve Hosgood) writes: >Several people have been talking about Trigraphs recently... >Now IMHO, we're seeing here the consequences of restricting the world's >computer users to a 7-bit coding system originally designed just for American >English. Surely it would be better for ANSI to scrap formally the concept of >7-bit coding and move to better things? ... You've got the problem exactly backwards. ANSI C, and most other language projects now current, are perfectly happy to assume 8-bit character sets. The problem is that the *complainers* have 7-bit equipment that uses a different 7-bit standard, and *they* don't want to be forced to upgrade. They want officially-blessed, easy-to-read ways to write ANSI C using their own old equipment. (What next, an ANSI C encoding for the IBM Model 26 keypunch?!?) There is in fact a standard set of 8-bit character sets, the ISO Latin sets, that solve this problem completely -- each one has full ASCII as a subset. ISO Latin 1 covers essentially all the Western European languages (there is some small problem with Welsh that slipped through by accident), in particular. There are standard shift sequences to reach other alphabets. (Although shifts are an enormous pain in string manipulation, which is why ANSI C recognizes the notion of "wide character" to deal with such things internally as unshifted codes.) Someday the terminals etc. will speak ISO Latin, and that will solve this set of problems. (Then we'll have the oriental languages to deal with... the existing code-extension hooks can cope in theory, but in practice it's cumbersome.) -- Mars in 1980s: USSR, 2 tries, | Henry Spencer at U of Toronto Zoology 2 failures; USA, 0 tries. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
iiit-sh@cybaswan.UUCP (Steve Hosgood) (05/15/89)
In article <10194@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes: >You appear to have the C language standardization committee confused >with character code set standardization committees. In fact there >have been numerous attempts to deal with this problem... Sorry about the time delay in replying to this, but our newsfeed went missing for a week.. The point I wanted to make (and I didn't express myself very well) was that surely the C language standardization committee has confused its brief with that of the character set standardization committee? In article <397@cybaswan.UUCP> Henry Spencer writes: > You've got the problem exactly backwards. ANSI C, and most other language > projects now current, are perfectly happy to assume 8-bit character sets. > The problem is that the *complainers* have 7-bit equipment that uses a > different 7-bit standard, and *they* don't want to be forced to upgrade. > They want officially-blessed, easy-to-read ways to write ANSI C using > their own old equipment. (What next, an ANSI C encoding for the IBM Model > 26 keypunch?!?) > Exactly, Henry, but again I come back to the question of whose problem are we talking about? Typing curly brackets, pipe symbols, and hashes on 7-bit equipment is surely a problem that is general to modern computing - many languages use such characters, the C-shell uses them, some editors use them in commands, etc, etc. Henry Spencer continues: > ...... Someday the [Danish and other] terminals etc. will > speak ISO Latin, and that will solve this set of problems. Yeah, but C compilers will end up carting around trigraphs in their lexical analysers for evermore.... Sorry to bring this up again folks, but I'm *still* unhappy. The 'UCASE' hack to allow UN*X to work on silly old terminals was put into the TTY handler. So I believe should this trigraph thingy. Steve -----------------------------------------------+------------------------------ Steve Hosgood BSc, | Phone (+44) 792 295213 Image Processing and Systems Engineer, | Fax (+44) 792 295532 Institute for Industrial Information Techology,| Telex 48149 Innovation Centre, University of Wales, +------+ JANET: iiit-sh@uk.ac.swan.pyr Swansea SA2 8PP | UUCP: ..!ukc!cybaswan.UUCP!iiit-sh ----------------------------------------+------------------------------------- My views are not necessarily those of my employers!
gwyn@smoke.BRL.MIL (Doug Gwyn) (05/18/89)
In article <442@cybaswan.UUCP> iiit-sh@cybaswan.UUCP (Steve Hosgood) writes: >surely the C language standardization committee has confused its brief with >that of the character set standardization committee? Not with respect to trigraphs; far from requiring sufficient character set support, X3J11 bent over backwards to require only the minimum practical character set according to already extant international standards. However, X3J11 did mandate that the character values for '0'..'9' have adjacent values in ascending numerical order. That is clearly a code set requirement, which I argued against. The need for some way to map digit characters to numbers and vice versa does exist, but other means to meet this need could have been specified. For example, my standard application system-tailored configuration header contains the following, which ANSI C conforming implementations must support: /* integer (or character) arguments and value: */ #define tonumber( c ) ((c) - '0') /* convt digit char to number */ #define todigit( n ) ((n) + '0') /* convt digit number to char */ Of course this is edited as required to mach the actual implementation. For years, I had been using the portable definition #define todigit( n ) "0123456789"[n] /* convt digit number to char */ but I never figured out a really good portable definition for tonumber(). I would much rather X3J11 have standardized macros like these than imposing requirements on the code set. The X3J11 requirement can "work" only because all known implementations happen to already meet the requirement. If they didn't, it would be impractical to fix them! >The 'UCASE' hack to allow UN*X to work on silly old terminals was put >into the TTY handler. So I believe should this trigraph thingy. Not every system has such facilities, but I agree with your general sentiment. In fact I expect that some of the more enlightened implementors will take exactly this tack to deal with practical use of so-called "European character sets". The new ISO code set standards should also help. C trigraphs should remain essentially an inter-site code transporting aid.
iiit-sh@cybaswan.UUCP (Steve Hosgood) (05/23/89)
In article <10284@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes: >However, X3J11 did mandate that the character values for '0'..'9' have >adjacent values in ascending numerical order. That is clearly a code >set requirement, which I argued against. The need for some way to >map digit characters to numbers and vice versa does exist, but other >means to meet this need could have been specified. Seems like a job for <ctype.h> to me. Interesting though, I had never considered the possibility of non-contiguous numbers and alphabetics rearing its head now that EBCDIC is dead (slight :-)). >>The 'UCASE' hack to allow UN*X to work on silly old terminals was put >>into the TTY handler. So I believe should this trigraph thingy. >Not every system has such facilities, but I agree with your general >sentiment. In fact I expect that some of the more enlightened >implementors will take exactly this tack to deal with practical use >of so-called "European character sets". But if this trigraph thing gets into the standard, then *all* conforming compilers will *have* to have the code in their lexical analysers. As you say, enlightened (:-)) implementors will probably deal with the problem in the handler, but the compiler carries the baggage around for evermore *as well*. >The new ISO code set standards should also help. I certainly hope so. Presumably the C standard allows for 8-bit character sets? Also, what about such things as allowable characters in identifiers and such like? Just yesterday, I was writing a program where I would have liked to have used Greek characters as identifiers. Is that sort of thing permissable? Would 'toupper' return upper-case Epsilon if given lower-case epsilon as an argument? It's a tricky can of worms, and it gets worse the closer you look at it. Steve
gwyn@smoke.BRL.MIL (Doug Gwyn) (05/27/89)
In article <456@cybaswan.UUCP> iiit-sh@cybaswan.UUCP (Steve Hosgood) writes: >I certainly hope so. Presumably the C standard allows for 8-bit character sets? Understatement of the decade. >Also, what about such things as allowable characters in identifiers and such >like? Just yesterday, I was writing a program where I would have liked to have >used Greek characters as identifiers. Is that sort of thing permissable? The Standard does not permit use of any character other than _, 0-9, a-z, and A-Z in identifiers, although comments may contain just about anything. >Would 'toupper' return upper-case Epsilon if given lower-case epsilon as an >argument? That's implementation- and locale-dependent. In the default ("C") locale, only 'a' through 'z' are mapped into different characters by toupper().
giguere@aries5.uucp (Eric Giguere) (05/27/89)
In article <456@cybaswan.UUCP> iiit-sh@cybaswan.UUCP (Steve Hosgood) writes: > Interesting though, I had never >considered the possibility of non-contiguous numbers and alphabetics rearing >its head now that EBCDIC is dead (slight :-)). With all those IBM mainframes (you know, the ones that run VM/CMS, MVS, TSO, etc. -- the operating systems that everyone in the papers is advertising positions for all the time....) the demise of EBCDIC is still a looooooong way off. Remember that Whitesmiths is on the executive of the ANSI committee and they market a C compiler for those same machines.... ....and so do we in Waterloo C. So programming one of those mainframes doesn't have to be so bad since there's a real language available.... Eric Giguere 268 Phillip St #CL-46 For the curious: it's French ("jee-gair") Waterloo, Ontario N2L 6G9 Bitnet : GIGUERE at WATCSG (519) 746-6565 Internet: giguere@aries5.UWaterloo.ca "Nothing but urges from HELL!!"
rbutterworth@watmath.waterloo.edu (Ray Butterworth) (05/30/89)
In article <10331@smoke.BRL.MIL>, gwyn@smoke.BRL.MIL (Doug Gwyn) writes: > In article <456@cybaswan.UUCP> iiit-sh@cybaswan.UUCP (Steve Hosgood) writes: > >Also, what about such things as allowable characters in identifiers and such > >like? Just yesterday, I was writing a program where I would have liked to have > >used Greek characters as identifiers. Is that sort of thing permissable? > > The Standard does not permit use of any character other than _, 0-9, a-z, > and A-Z in identifiers, although comments may contain just about anything. Note that the Standard is defined in terms of what a compiler must do with a conforming program. It does not dictate much about what a compiler must do with programs that do not conform to the Standard. In particular, it does not prevent any standard compiler from accepting identifiers with other characters in them so long as those characters could not legally appear in the same place in a conforming program. e.g. An ANSI compiler could not accept "." in identifers, since a conforming program could have "a.b" and that must be parsed as "a . b". But an ANSI compiler is allowed to accept as an extension an identifier with an umlauted U in it, although no such program can be considered as conforming to the Standard (one would expect the compiler to have an option that enables warnings about such non-standard extensions, so that software written with German identifiers could be cleaned up to conform to the Standard before it is distributed to other compilers).
henry@utzoo.uucp (Henry Spencer) (05/30/89)
In article <26621@watmath.waterloo.edu> rbutterworth@watmath.waterloo.edu (Ray Butterworth) writes: >Note that the Standard is defined in terms of what a compiler must do >with a conforming program. It does not dictate much about what a >compiler must do with programs that do not conform to the Standard. > >In particular, it does not prevent any standard compiler from accepting >identifiers with other characters in them so long as those characters >could not legally appear in the same place in a conforming program. > >...one would expect the compiler to have > an option that enables warnings about such non-standard extensions... Note the wording in 2.1.1.3, which implies (subject to the interpretation of some of the terms) that a compiler is in fact *required* to produce at least one warning for any input file that violates the Standard's syntax rules or constraints. That doesn't mean it has to refuse to compile it, mind you. -- Van Allen, adj: pertaining to | Henry Spencer at U of Toronto Zoology deadly hazards to spaceflight. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
diamond@jit345.swstokyo.dec.com (Norman Diamond) (04/04/91)
In the recent exchanges between keld@login.dkuug.dk (Keld J|rn Simonsen) and gwyn@smoke.brl.mil (Doug Gwyn), there is evidence of misunderstanding on both sides. Mr. Simonsen missed a subtle English ambiguity in an ANSI response that Mr. Gwyn quotes, and went off on a tangent. Mr. Gwyn missed a more important technical issue, and appears to have gotten all defensive about Mr. Simonsen's complaint instead of recognizing the technical point. dg>THE MAPPING BETWEEN INTERNAL CODE SET VALUES AND THE ENVIRONMENT CAN, dg>AND SHOULD, BE DEFINED BY THE C IMPLEMENTATION. ks>This I read to have the consequences that all characters in the ks>source C program and every input and output file SHOULD have ks>conversions applied to all widechar strings. This could be done in the ks>mbstowcs() and other mb/wc functions. Even ordinary characters (non-widechars) might need conversions. ks>Thus the internal widechar representation of 'c' and the external ks>multibyte representation SHOULD not be the same for character sets ks>like ISO 10646, JIS X 0208, KS C 5601 and GB 2312. ks>At least this should hold for characters in the C character set. It seems to me that this is exactly the reason for the mb/wc functions. However, something is missing for ordinary characters. ks>ANSI and DS (Danish Standards, the Danish equivalent to ANSI in ISO) ks>have disagreed on the necessity of a *readable* and *writeable* alternative ks>to a representation of C source in invariant ISO 646. Exactly. dg>Indeed, X3J11 explained this to you in the third public review response dg>document. It did not. dg> In response to Letter #177, Doc. No. X3J11/88-134: dg> Summary of issue: dg> Proposal for more readable supplement to trigraphs. Confusion of issue. dg> X3J11 response: [...] dg> Trigraphs were intended to provide a universally portable format dg> for the *transmission* of C programs; they were never intended dg> to be used for day-to-day reading and writing of programs. Exactly. However, a readable and writable alternative is ALSO necessary. By co-incidence (well, not entirely, but can be treated that way), it would also make trigraphs unnecessary. A readable and writable alternative is necessary for programming, regardless of whether transmission is done. [...] dg> Translation phase 1 actually consists of two parts, first dg> the mapping (about which we say very little) from the external dg> host character set to the C source character set, then the dg> replacement of C source trigraph sequences with single C source dg> characters. (Note that the C source characters represented in dg> our documents in Courier font need not appear graphically the dg> same in the host environment, although a reasonable dg> implementation will make them as nearly so as possible.) dg> The kind of mapping you propose can in fact be done in the first dg> part of translation phase 1, and several such "convenience" dg> mappings are already common practice. However, attempting to dg> standardize this mapping is outside the scope of the C Standard, dg> since what is appropriate may depend on the capabilities of the dg> specific hardware, availability of fonts, and so forth. ks>This I read as just a story with irrelevant facts about fonts ks>(Courier etc.). No. The clause about fonts was not intended to give importance to fonts etc. It is intended only to help identify which characters in the standard are C source characters, as opposed to those which are English or BNF. The main thrust of the statement is that C source characters do not have to look like Roman letters and punctuation marks in the host environment, though a "reasonable" implementation would do "as nearly so as possible." (I find this use of the word "reasonable" offensive, an innuendo bordering on racism. Many programmers would not like programming in C on a machine that does not have Roman letters, but that does not make the machine, or an implementation of C thereon, unreasonable.) ks>This X3J11 judgement is not at all "on technical issues" - all technical ks>issues are admitted to be solveable! The decision is a political one. If the misunderstanding was cleared up, then it was political. However, it is not clear from the postings in this group, whether the misunderstandings were cleared up in time. ks>The reason why the Japanese have not seen the problem before with ks>JIS X 0208, but first with 10646, is beyond my understanding. ks>Maybe some Japanese could enlighten us (me!) on this? Maybe. This Japanese resident can make a stab in the dark which sounds plausible: In JIS 2.6, there was no problem with ASCII characters. A byte which had its high bit 0 was not part of a JIS 2.6 character. It is possible that 0208 didn't have much of a following until recently. (I don't know 0208 at all, so have to take the words of others about the problems that arise.) -- Norman Diamond diamond@tkov50.enet.dec.com If this were the company's opinion, I wouldn't be allowed to post it.
decot@hpisod2.cup.hp.com (Dave Decot) (04/09/91)
A coded character set should be considered to be an agreement between two or more of the following parties: the programming language processor that converts a particular character specification into a binary bit pattern the font-drawing (or dotmatrix-selection) software in the output device on which the bit pattern will be ultimately be translated into a visual character form the input device from which the bit pattern will be expected to be generated in response to selection of that character The solutions must thus be addressed to two or more of these areas. Dave Decot
erik@srava.sra.co.jp (Erik M. van der Poel) (04/10/91)
Norman Diamond writes: > In JIS 2.6, there was no problem with ASCII characters. What is JIS 2.6? > It is possible that 0208 didn't have much of a following until recently. There are three versions of JIS X 0208: JIS X 0208-1978 JIS X 0208-1983 JIS X 0208-1990 In 1987(?), they had a grand reorganization of the names. Prior to the renaming, JIS X 0208 was: JIS C 6226-1978 JIS C 6226-1983 It is safe to say that 0208 has had a following for quite a few years. - -- Erik M. van der Poel erik@sra.co.jp Software Research Associates, Inc., Tokyo, Japan TEL +81-3-3234-2692
harkcom@spinach.pa.yokogawa.co.jp (04/11/91)
In article <1108@sranha.sra.co.jp> erik@srava.sra.co.jp (Erik M. van der Poel) writes: =}It is safe to say that 0208 has had a following for quite a few years. Amongst the standards comittee it seems to have had a decade+ of following, but how many programmers actually use JIS in programming. The capability of handling it exists in many packages, but I think it's rarely used as an internal encoding (I think SJIS is used the most, but I'm not an expert). I think Mr. Diamond's point is valid even in reference to JIS X 0208. The ASCII characters are not a part of the standard and are handled as ASCII... Al