[net.nlang] National alphabets and computers

sommar@enea.UUCP (Erland Sommarskog) (09/04/85)

The debate in net.nlang about diacritical marks put the finger on a very
tender spot for the computer (and its periphals): How to make it fit to 
all human beeings and all their funny languages. With this article 
I'd like to develop some of my ideas of what I wish from the machine.

Almost every language has its unique alphabet. Also every language
has its unique set of diacritical marks. The alphabet is clearly defined, 
the set of marks is more vague due to loans from other languages.
Some examples: 
     (These might not be completely correct, send me corrections
      by mail and leave it out of the newsgroup, please.)
    
     Letters that are put under each other, are equivalent when sorting.
     
 Swedish:
 
    A B C D E F G H I J K L M N O P Q R S T U V X Y Z oA "A "O
   (`A)    'E                                 W   "U
           ('E)                                 

 English:
 
   A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
           'E
 
  French:
  
   A B C D E F G H I J K L M N O P Q R S T U V X Y Z
   'A  ,C  'E      "I          OE         'U W
   "A      `E                  ^O         "U
           ^E                  "O
	   "E
   
  Polish:
  
     A B C 'C CZ D DZ D'Z E F G H I J K L /L M N 'N 
     ,A                   ,E
     
     O P Q R RZ S 'S SZ T U W X Y Z .Z
     'O                     
     
 From this we can recognize two problems. If we like the computer
 to alter between diffrent languages it must alter its character
 set. This is solved today by making national ASCII:s, where
 some characters (usually @[\]^`{|}~) are replaced by national
 characters. This solutions puts the light on the other problem:
 Sorting. (And then I do not only meant programs that called SORT,
 but also (programming) languages concepts as comparison and functions
 like pred and suc.)
 
 When we changes between languages it would be very nice if the 
 computer did follow us. (Of course we had to tell it.) Two things
 should be staisfied:
 1) Sorting on the national alphabet
 2) Ignoring diacritical marks when sorting, yet representing
    them when the letters are displayed.
 With the concept of today with ASCII, this is a tough game to carry out.
 If two characters look different, they also have different values. And
 as we can see from above, what is a separate letter in one language
 is just variant in an other... (W for instance)
 Some diacritical marks have a general meaning (like the trema (") in 
 French), some are bound to a certian letter... (Like the cedilla to C in
 French)
  
 On a typewriter you often do such with having key for accents and similair
 mute, i.e. the paper won't move when you use them. This is of course
 not a problem with the computer, but with the terminal. A non-graphic
 terminal is seldom able to display two characters in the same position.
 
 Well a lot more could be said. One fast conslusion we can draw from
 this is that a character just can't be represented by a single byte.
 This would cost in efficiency, that's true. Yet I think the situation
 of today is quite unsatisfying, and I think that many of you agree.
 
 What the future might bring I don't know. From this it is quite clear
 that a 256-character ASCII-set wouldn't solve the problem, but of course
 it would mean an improvement.
 
 To conclude: Computers are very intelligent, still
              they are like idiots because they can't talk with people.