[comp.lang.eiffel] Eiffel and national character sets

sommar@enea.se (Erland Sommarskog) (07/09/89)
In one of his many interesting articles Bertrand Meyer said if one 
wants changes in Eiffel the next six months is the time to ask for 
them. Since I think Eiffel has to be improved with regards to supporting 
human langauges other English, I will try to summarize the potential 
problem areas I'm aware of. I'm not an expert in neither Eiffel nor 
character-set standards, so I will not try to provide complete solutions, 
but merely point out the requirements.
  Before I go any further, let me add that these problems are in no
way unique to Eiffel. Since some time ISO requires support of multi- 
national character sets when they are approving or revising a
language standard. The problems connected with this vary from langauge
to langauge, as do the solutions. 
  I first give an overview of existing character sets. I then discuss
the various problem areas: Operator and delimiter characters, literals, 
identifiers and finally operations as comparisons.

Standardized character sets
---------------------------
There are several character-set standards, which I briefly describe
since not everyone may be well acquinted with them.
ASCII     - The standard on which about everything else is based. 
EBCDIC    - Appears in some worlds, but none I have experience of.
ISO 646   - A seven-bit standard, where some characters in the ASCII
            are replaced by national characters. Which are the replace-
            ments depends on the country you're in. For instance in 
            Sweden left brace is replaced with dotted "a", in France  
            it's an "e" with accent aigue.
ISO 2022  - A standard that describe how to change between different
            character sets.
ISO 6937  - A eight-bit set, which doesn't seem to have been adopted 
            very much. Slots 0-127 are ASCII. From 161 and up are mute 
            modifiers and national letters. With 6937 dotted "a" is 
            produced by first given a diaresis and then the "a". 
            Virtually all langauges with a Latin alphabet, except 
            Vietnamese, could be written with this set.     
ISO 8859  - Nine standards which all have ASCII in 0-127 and control 
            characters in 128-159. Then the contents varies depends on 
            the geographical are addressed. Five of the sets has Latin 
            characters, then there's one each for Cyrillic, Hebrew, Arabic 
            and Greek. I don't know whether all nine has finally been 
            settled. Some may still be drafts.
              One could expect that the absolutely most commonly supported
            will be 8859/1, also known as Latin-1. Latin-1 covers most 
            of the languages in Western Europe. (Exlcuded are Welsh and 
            Catalan.)
              8859 has no mute characters.
ISO 10646 - A multi-octet character set, which is under development. 
            I know very little of it, but I doubt there is even a 
            draft of it. There was a posting about it in comp.std.internat 
            some time ago.
In the following I will conentrate on ISO 646 and 8859. Although I
personally am appealed by the ideas in 6937, its use in real life
is small, so I'm disregarding the problems that supporting this standard
would cause.                                                     
            
Operator and delimiter characters
---------------------------------
With an eight-bit set based on ASCII there are rarely any problem. 
However, in a seven-bit world there are. Any programming language using 
any of the characters @[\]^`{|}~ as an operator or a delimiter is 
committing a crime in my eyes. Most of the national sets that ISO 646 
defines replaces these characters with national characters, and in many 
cases these characters are letters. So in my eyes a notation like 
   class BIN_TREE [T]
is just as bad as:
   class BIN_TREE ZTQ
(Read Z and Q as opening and closing delimiters!) 
  Many languages that use these characters alleviates the problem
by providing alternative tokens. For instance, Ada allows you use "!"
for "|". Many Pascal compilers allow "(." and ".)" for [], and (* and
*) are more common than {}.
  Eiffel is a sinner in this field. With Dr. Meyer's origin in mind, 
I assume he is not unaware of the problems, but has chosen to ignore 
them. Still I hope he will re-think and change his mind on this issue. 
Letters as special characters is simply not a good idea. 
  One could argue that since we're moving into a eight-bit world,
this is a disappearing problem, but remember that that transition
is slow. We will live with seven-bits terminals and printers quite
a long time still.

Now, what actual problems do we have in Eiffel? The occurance of brackets 
and braces in Eiffel is restricted to the class declaration and the export 
clause which gives less pain than if they could occur anywhere. Anyway, 
finding replacement characters should be easy. (To be honest I don't 
really see why they had been chosen in the first place. Is there some 
lexical problem that prevents simple parenthesis?)
  Worse is the backslash. Could you think of having to double all "W"s 
in your string literals? Probably not, so you wouldn't pick "W" as 
the escape character. Eiffel has chosen the bad habit from C of using          
dotted "O" (which is how the backslash appears on my screen). Here 
I not only want an alternate character, but also I want to get rid 
of the original. (As a whole, I am not fond of the C style of writing 
character and string literals, why use octal codes?)
                  
Literals
--------
Which characters can I use in string and character literals? If        
we forget the fatal backslash, Eiffel doesn't give me any problem 
if I'm using any of the 8859 standards. It is just to go ahead and 
use them. (At least that is what its description alludes. For what 
happens in real life, see an adjacent article of mine.)
  Other people will get problems, though, mainly Japanse and Chinese  
programmers. I.e., there is no support for multi-byte sets.
  As a side note, a language which really is evil here is Ada. Ada
explictly forbids non-printing charcaters in literals, and "non-
printing" is defined from ASCII, so using the upper half of Latin-1 
in Ada is a real pain. Ada-9X will resolve this, but that's another 
three or four years from now. Sigh. 

Identifiers
-----------
Eiffel, as most other langauges, allows the letters in the English
alphabet in identifiers. However, if you're writing code in your native 
langauge, you may need to use other letters as well. To be able to  
use the replacement characters in ISO 646 would be nice, but it would
be pointless to require that.
  But with eight-bit sets in 8859, it is a fair requirement that all 
letters in these sets also are permitted in identifier names. 
  The problem lies in the difference between the sets. In Latin-1 
161-191 are punctuation characters which you normally wouldn't think of 
in identifier names. 192-255 are letters, with 32 as difference between 
lower and upper case. (A few exceptions which I disregard here for 
brevity.)
  In the other Latin sets, some characters in the range 161-191 are 
also letters, with 16 as the case difference. 
  How the non-Latin sets look like I don't know.
  One could make this a very simple issue and just take Latin-1, with 
the motivation that is what will be used in the known world of computing.
However, I think this would be fatal mistake. Should our friends in
Hungary, Russia and Greece be handicapped in the selection of
identifier characters? Do we know that "the known world of computing"
will forever restrict itself to to places were Western European 
langauges are spoken?
  Now then, how to support mulitple sets? An idea would be to have a 
directive that said which character set the source code was written with. 
We must of course immediately discard the idea, since this is impossible 
in a modular langauge like Eiffel. (What if we want to inherit that 
Latin-2 class in our Latin-1 class?)                            
  As far as I can see the only way to go is to allow all characters
>= 161 and then use the 32 and 16 differences for case folding. (A
case significant language like C or Modula-2 has some advantage here.) 

Operations         
----------
When comparing two strings the collating sequence often has little
relevance with the alphabet. The only languages I know it works for 
using ISO 646 are English, Danish, Norwegian and maybe Dutch. As a 
whole one should remember that the character type in this sense is 
not a simple enumerate. In many languages you only take regard to
accents and umlauts when no other character is different. And some
langauges have pairs of letters that sorts as one. (E.g. "ch" in
Spanish, "rz" in Polish.)   
  What you need is a set of extended comparison routines, a set of
predefined langauges and a set of routines for loading your very
own sorting order. Eiffel is extremely well prepared in this area,
particulary with the additions of infix operators in 2.2. So all
that is needed is some additions to the class library. Of course
I could write them myself, but I think they should be in the 
standard library, since this is the way strings should be compared.
Using the collating sequence is a very artificial way to do it.
  Or, is library additions really all we need? If we define a class 
TRANSCRIPTED_STRING which codes a string to some internal format
for comparisons we would like to write:
    t_str : TRANSCRIPTED_STRING;
    ...
    t_str := "Some string";
But even if our new class is an heir of STRING, the assignment is
not permitted. And defining a TRANSCRIPTED_CHARACTER for single
elements as an heir to CHARACTER is out the question, since the
latter class is an expanded one and may not be inherited from.   
  One solution to these problems would of course be to inlcude
the required operations within STRING and CHARACTER. There are
probably some performance penalty for Americans who don't want
more than simple ASCII comparison, but it's certainly a solution
that looks very appealing.
  It should be added here, that there are various operating systems,
not the least in the Unix sphere, that supports handling of more
than one human language which includes run-time support for
comparisons. But they are often intended for C and Eiffel gives
room for much cleaner interfaces.

In this article I have discussed very little of multi-byte characters, 
since I have no experience of using them. However, they should not be 
forgotten when addressing these problems. They write programs in Japan 
too.
-- 
Erland Sommarskog - ENEA Data, Stockholm - sommar@enea.se
Bowlers on strike!