[comp.std.internat] Please use the standards!

dan@sics.se (Dan Sahlin) (09/08/87)

Being a member of a working group for character standardisation, I would
like to point out that there is almost always already a standard for most
needs. There is not much of a need to invent more (even if we are working
that too!).

Most of all, there is a standardised way of switching between coding
standards called ISO 2022. The European Computer Manufacturers Association
(ECMA) has registered about 100 character sets according to ISO 2022 and
the standard for registering new standards (!) ISO 2373.
Since among these, we find the various 8859 versions, there is a
standardised way to switch between them.  You will also find the Arabic,
Hebrew and Cyrillic character sets in the ECMA register.
Best of all, at least in Europe, you will get the full set of registered
characters for free if you write to Registration Authority for ISO 2375,
ECMA, Rue du Rhone 114, CH-1204 GENEVA, Switzerland. Ask for the
current full set pages of the "International Register". You may also become
an "owner of the international register" which means that you get a copy
of all newly registered character sets.

Please note that the character sets based on 6937 are not registered by
ECMA (as far as I know).

Personally I vote for 8859 immediately, 6937 thereafter and finally multi-
octet coding!

	Dan Sahlin      (dan@sics.uucp)

andersa@kuling.UUCP (Anders Andersson) (09/15/87)

In article <1498@sics.se> dan@sics.se (Dan Sahlin) writes:
>Most of all, there is a standardised way of switching between coding
>standards called ISO 2022. The European Computer Manufacturers Association
>(ECMA) has registered about 100 character sets according to ISO 2022 and
>the standard for registering new standards (!) ISO 2373.
>Since among these, we find the various 8859 versions, there is a
>standardised way to switch between them.  You will also find the Arabic,
>Hebrew and Cyrillic character sets in the ECMA register.

Yes, there are escape sequences for selecting any set, and I agree that
this is an appropriate way to represent sequential data, but what if you
want to access portions of the text randomly in memory? Is the programmer
supposed to search the file from the beginning for the latest set switch
when he want to know whether the byte 68 in position 4711 is a capital E
with grave accent (as in registration 123) or the second byte in a small
Greek delta (as in registration 58)? Is it the programmer's task to
provide for efficiency by creating his own internal data structures, or
will ISO 2022 eventually be more specific on implementation demands (such
as repeating the escape sequence within certain intervals, although there
is no change in character set)? For instance, ISO 2022 doesn't tell me
how to interpret the first byte of a file, unless it's an escape sign.
Have I missed some page of their documentation?
-- 
Anders Andersson, Dept. of Computer Systems, Uppsala University, Sweden
Phone: +46 18 183170
UUCP: andersa@kuling.UUCP (...!{seismo,mcvax}!enea!kuling!andersa)

sommar@enea.UUCP (Erland Sommarskog) (09/20/87)

andersa@kuling.UUCP (Anders Andersson) writes:
>In article <1498@sics.se> dan@sics.se (Dan Sahlin) writes:
>>Most of all, there is a standardised way of switching between coding
>>standards called ISO 2022. The European Computer Manufacturers Association
>>(ECMA) has registered about 100 character sets according to ISO 2022 and
>>the standard for registering new standards (!) ISO 2373.
>>Since among these, we find the various 8859 versions, there is a
>>standardised way to switch between them.  You will also find the Arabic,
>>Hebrew and Cyrillic character sets in the ECMA register.
>
>Yes, there are escape sequences for selecting any set, and I agree that
>this is an appropriate way to represent sequential data, but what if you
>want to access portions of the text randomly in memory? 

Anders' obejction is very valid, I think. The use of escape sequences
doesn't make mixing of letters from different alphabets easy.
  Another example is a compiler. What characters should it accept as
part of identifier names? All letters and numbers, doesn't that seem
reasonable? But with all these sets it is difficult. Take the four 
8859 Latin versions. Latin 1 has letters code 192 and upwards, whereas
the Latin 2-4 all have letters below 192 too. So the compiler must know
the escape sequences and all the standards. Of course it is possible to 
implement, but somehow I think that compiler writers are too lazy for that. 
(And I wouldn't be surprised if someone found use for a character in the 
range 160..192 from Latin 1 as a special character. That character being 
a letter in Latin 2-4.)
  As an example, VAX-pascal supports DEC multinational character set
(which is based on an old draft of Latin 1), at least it says so in the
manual. But what happens if you try to use a letter from the upper half
as part of an identifier? "Illegal ASCII character". Ridiculous!
-- 

Erland Sommarskog       
ENEA Data, Stockholm    
sommar@enea.UUCP