dan@sics.se (Dan Sahlin) (09/08/87)
Being a member of a working group for character standardisation, I would like to point out that there is almost always already a standard for most needs. There is not much of a need to invent more (even if we are working that too!). Most of all, there is a standardised way of switching between coding standards called ISO 2022. The European Computer Manufacturers Association (ECMA) has registered about 100 character sets according to ISO 2022 and the standard for registering new standards (!) ISO 2373. Since among these, we find the various 8859 versions, there is a standardised way to switch between them. You will also find the Arabic, Hebrew and Cyrillic character sets in the ECMA register. Best of all, at least in Europe, you will get the full set of registered characters for free if you write to Registration Authority for ISO 2375, ECMA, Rue du Rhone 114, CH-1204 GENEVA, Switzerland. Ask for the current full set pages of the "International Register". You may also become an "owner of the international register" which means that you get a copy of all newly registered character sets. Please note that the character sets based on 6937 are not registered by ECMA (as far as I know). Personally I vote for 8859 immediately, 6937 thereafter and finally multi- octet coding! Dan Sahlin (dan@sics.uucp)
andersa@kuling.UUCP (Anders Andersson) (09/15/87)
In article <1498@sics.se> dan@sics.se (Dan Sahlin) writes: >Most of all, there is a standardised way of switching between coding >standards called ISO 2022. The European Computer Manufacturers Association >(ECMA) has registered about 100 character sets according to ISO 2022 and >the standard for registering new standards (!) ISO 2373. >Since among these, we find the various 8859 versions, there is a >standardised way to switch between them. You will also find the Arabic, >Hebrew and Cyrillic character sets in the ECMA register. Yes, there are escape sequences for selecting any set, and I agree that this is an appropriate way to represent sequential data, but what if you want to access portions of the text randomly in memory? Is the programmer supposed to search the file from the beginning for the latest set switch when he want to know whether the byte 68 in position 4711 is a capital E with grave accent (as in registration 123) or the second byte in a small Greek delta (as in registration 58)? Is it the programmer's task to provide for efficiency by creating his own internal data structures, or will ISO 2022 eventually be more specific on implementation demands (such as repeating the escape sequence within certain intervals, although there is no change in character set)? For instance, ISO 2022 doesn't tell me how to interpret the first byte of a file, unless it's an escape sign. Have I missed some page of their documentation? -- Anders Andersson, Dept. of Computer Systems, Uppsala University, Sweden Phone: +46 18 183170 UUCP: andersa@kuling.UUCP (...!{seismo,mcvax}!enea!kuling!andersa)
sommar@enea.UUCP (Erland Sommarskog) (09/20/87)
andersa@kuling.UUCP (Anders Andersson) writes: >In article <1498@sics.se> dan@sics.se (Dan Sahlin) writes: >>Most of all, there is a standardised way of switching between coding >>standards called ISO 2022. The European Computer Manufacturers Association >>(ECMA) has registered about 100 character sets according to ISO 2022 and >>the standard for registering new standards (!) ISO 2373. >>Since among these, we find the various 8859 versions, there is a >>standardised way to switch between them. You will also find the Arabic, >>Hebrew and Cyrillic character sets in the ECMA register. > >Yes, there are escape sequences for selecting any set, and I agree that >this is an appropriate way to represent sequential data, but what if you >want to access portions of the text randomly in memory? Anders' obejction is very valid, I think. The use of escape sequences doesn't make mixing of letters from different alphabets easy. Another example is a compiler. What characters should it accept as part of identifier names? All letters and numbers, doesn't that seem reasonable? But with all these sets it is difficult. Take the four 8859 Latin versions. Latin 1 has letters code 192 and upwards, whereas the Latin 2-4 all have letters below 192 too. So the compiler must know the escape sequences and all the standards. Of course it is possible to implement, but somehow I think that compiler writers are too lazy for that. (And I wouldn't be surprised if someone found use for a character in the range 160..192 from Latin 1 as a special character. That character being a letter in Latin 2-4.) As an example, VAX-pascal supports DEC multinational character set (which is based on an old draft of Latin 1), at least it says so in the manual. But what happens if you try to use a letter from the upper half as part of an identifier? "Illegal ASCII character". Ridiculous! -- Erland Sommarskog ENEA Data, Stockholm sommar@enea.UUCP