rcd@opus.UUCP (Dick Dunn) (11/04/85)
> >>Unfortunately, you CAN'T build a good international character set. > >>Some of those silly European countries have the same character in > >>several languages, but sort the character in different places in each > >>language. They also have interesting constructs like characters that > >>sort as two characters, and pairs of characters that sort as single > >>characters... This view mixes several different matters: - getting an international character set - dealing with multiple-letter groupings and ligatures - sorting and searching - collating To get a useful international character set, all you need is a mapping of enough codes to enough different symbols that you can prepare text for all the languages of interest. This doesn't prejudice how many bits you need, nor whether some symbols may require more than one datum (byte) for representation. (Example: Letters with diacritical marks might be one byte for a special code, two to combine a non-escaping diacritical with a letter, or three to combine a letter, a backspace, and a diacritical. The only real constraint is that in the third case you should decide whether the letter or the mark comes first in the sequence.) Multiple letters are probably represented as multiple data for the actual sequence of letters. Ligatures are probably represented with single codes since they are distinct from the letters they comprise. (I am referring to ligatures in the linguistic sense, such as AE, rather than in the typographic sense, such as ffl.) Sorting, used in the sense of putting things in -some- order, for the purpose of being able to search or compare, is easy--note that you DON'T need to have any "sensible" order for this; it must only be consistent. You could process American text and order the alphabet A X C B D E Y... as long as you do so consistently. This makes more sense if I compare it with: Collating, used in the sense of putting things in order so that humans will agree that they are in order and can use the ordering, is quite a difficult problem. It cannot be done based on the character codes. Look at the ordering that ASCII would produce. If you think it's reasonable, consider that MacDonald will sort between MacAfee and Macafee. ASCII only works somewhat reasonably for simplistic use with American or English text; even then, it cannot deal reasonably with words of foreign origin. (Suppose that instead of canyon you have can~on (assume it were represented as it should be) - what sequence of bytes would represent it, and where would that sequence sort to, assuming the sort program didn't go weird on you?) What you have to do for humans is to sort according to a set of rules designed for the natural language you are processing. You have to know this at the outset, and the rules for most languages are not going to be trivial. It doesn't have to influence the character set design all that much. -- Dick Dunn {hao,ucbvax,allegra}!nbires!rcd (303)444-5710 x3086 ...Never attribute to malice what can be adequately explained by stupidity.