sommar@enea.UUCP (Erland Sommarskog) (09/04/85)
The debate in net.nlang about diacritical marks put the finger on a very
tender spot for the computer (and its periphals): How to make it fit to
all human beeings and all their funny languages. With this article
I'd like to develop some of my ideas of what I wish from the machine.
Almost every language has its unique alphabet. Also every language
has its unique set of diacritical marks. The alphabet is clearly defined,
the set of marks is more vague due to loans from other languages.
Some examples:
(These might not be completely correct, send me corrections
by mail and leave it out of the newsgroup, please.)
Letters that are put under each other, are equivalent when sorting.
Swedish:
A B C D E F G H I J K L M N O P Q R S T U V X Y Z oA "A "O
(`A) 'E W "U
('E)
English:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
'E
French:
A B C D E F G H I J K L M N O P Q R S T U V X Y Z
'A ,C 'E "I OE 'U W
"A `E ^O "U
^E "O
"E
Polish:
A B C 'C CZ D DZ D'Z E F G H I J K L /L M N 'N
,A ,E
O P Q R RZ S 'S SZ T U W X Y Z .Z
'O
From this we can recognize two problems. If we like the computer
to alter between diffrent languages it must alter its character
set. This is solved today by making national ASCII:s, where
some characters (usually @[\]^`{|}~) are replaced by national
characters. This solutions puts the light on the other problem:
Sorting. (And then I do not only meant programs that called SORT,
but also (programming) languages concepts as comparison and functions
like pred and suc.)
When we changes between languages it would be very nice if the
computer did follow us. (Of course we had to tell it.) Two things
should be staisfied:
1) Sorting on the national alphabet
2) Ignoring diacritical marks when sorting, yet representing
them when the letters are displayed.
With the concept of today with ASCII, this is a tough game to carry out.
If two characters look different, they also have different values. And
as we can see from above, what is a separate letter in one language
is just variant in an other... (W for instance)
Some diacritical marks have a general meaning (like the trema (") in
French), some are bound to a certian letter... (Like the cedilla to C in
French)
On a typewriter you often do such with having key for accents and similair
mute, i.e. the paper won't move when you use them. This is of course
not a problem with the computer, but with the terminal. A non-graphic
terminal is seldom able to display two characters in the same position.
Well a lot more could be said. One fast conslusion we can draw from
this is that a character just can't be represented by a single byte.
This would cost in efficiency, that's true. Yet I think the situation
of today is quite unsatisfying, and I think that many of you agree.
What the future might bring I don't know. From this it is quite clear
that a 256-character ASCII-set wouldn't solve the problem, but of course
it would mean an improvement.
To conclude: Computers are very intelligent, still
they are like idiots because they can't talk with people.