sommar@enea.UUCP (Erland Sommarskog) (09/04/85)
The debate in net.nlang about diacritical marks put the finger on a very tender spot for the computer (and its periphals): How to make it fit to all human beeings and all their funny languages. With this article I'd like to develop some of my ideas of what I wish from the machine. Almost every language has its unique alphabet. Also every language has its unique set of diacritical marks. The alphabet is clearly defined, the set of marks is more vague due to loans from other languages. Some examples: (These might not be completely correct, send me corrections by mail and leave it out of the newsgroup, please.) Letters that are put under each other, are equivalent when sorting. Swedish: A B C D E F G H I J K L M N O P Q R S T U V X Y Z oA "A "O (`A) 'E W "U ('E) English: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 'E French: A B C D E F G H I J K L M N O P Q R S T U V X Y Z 'A ,C 'E "I OE 'U W "A `E ^O "U ^E "O "E Polish: A B C 'C CZ D DZ D'Z E F G H I J K L /L M N 'N ,A ,E O P Q R RZ S 'S SZ T U W X Y Z .Z 'O From this we can recognize two problems. If we like the computer to alter between diffrent languages it must alter its character set. This is solved today by making national ASCII:s, where some characters (usually @[\]^`{|}~) are replaced by national characters. This solutions puts the light on the other problem: Sorting. (And then I do not only meant programs that called SORT, but also (programming) languages concepts as comparison and functions like pred and suc.) When we changes between languages it would be very nice if the computer did follow us. (Of course we had to tell it.) Two things should be staisfied: 1) Sorting on the national alphabet 2) Ignoring diacritical marks when sorting, yet representing them when the letters are displayed. With the concept of today with ASCII, this is a tough game to carry out. If two characters look different, they also have different values. And as we can see from above, what is a separate letter in one language is just variant in an other... (W for instance) Some diacritical marks have a general meaning (like the trema (") in French), some are bound to a certian letter... (Like the cedilla to C in French) On a typewriter you often do such with having key for accents and similair mute, i.e. the paper won't move when you use them. This is of course not a problem with the computer, but with the terminal. A non-graphic terminal is seldom able to display two characters in the same position. Well a lot more could be said. One fast conslusion we can draw from this is that a character just can't be represented by a single byte. This would cost in efficiency, that's true. Yet I think the situation of today is quite unsatisfying, and I think that many of you agree. What the future might bring I don't know. From this it is quite clear that a 256-character ASCII-set wouldn't solve the problem, but of course it would mean an improvement. To conclude: Computers are very intelligent, still they are like idiots because they can't talk with people.