sommar@enea.UUCP (10/31/87)
<Note that this is cross-posted. Please submit follow-ups to comp.std.internat only> Some months ago I asked for a new character concept where the result of string comparisons should depend on the selected langauges. I stated that the old ASCII concept one character <=> one collation value must go away. I have completed a string-comparison package, that performs much of what I asked for. (I still think that the OS and the compiler should do this for me.) The package itself have I posted to comp.sources.misc. I have included parts of the READ ME file below. Besides the package itself, the posting contains definition files for names of non-ASCII characters in ISO/Latin-1 and national 7-bit ASCII replacements. There is also a little demonstration program, reading lines from standard input and writing them sorted to standard output. The package is written in Ada, so if you don't have an Ada compiler, the practical use for it may be little, however the main purpose is rather to demonstrate the idea itself. ---------------------------------------------------------------------- Four-dimensional sorting ------------------------ The string-comparison package compares strings at four levels: 1) Alphabetic 2) Accents 3) Non-letters 4) Difference in case What is an alphabetic etc is up to the user. He may define "$" being a letter with "(" as its lowercase variant if he likes. One level is only regarded if the level above have no difference. As an example I take T^ete-`a-t^ete (I assume a "normal" loading of the character table here.) For the first level we use TETEATETE, thus we remove the accents and the hyphens. On the next we re-insert the accents so we get T^ETE`AT^ETE On level three we only take the hyphens in regard. When comparing non-letters the package uses the simple ASCII values. The earlier a character comes, the lower is the sort value. Thus, "trans-scription" will precede "transscrip-tion". (Actually, as the implementation is done, the position is more important than the ASCII value.) On the last level we use T^ete`at^ete thus, the original writing with the hyphens removed. Note that the user can specify case to be insigificant. (This isn't a description on how the package is implemented, just a way of illustrating the result. In practice it's done a little more effective.) When defining accented variants it is possible to let a character be a variant of a string, in this way the AE ligature can be sorted as "AE". The opposite is not possible, and what worse is, a string can't have an alphabetic value. Thus the package is not able to sort languages as Spanish (CH and LL) correctly. The number characters are handled in a special way if you define them as alphabetics. A sequence of figures will read as one number and sort after all other alphabetics. (Even if they were defined as the first characters.) So you will get File1 File2 File10 File11 instead of the usual File1 File10 File11 File2 If you like to sort them as they are read, this is also possible. E.g. load "0" as a variant of "zero". The package contains the following routines: Load Operations --------------- PROCEDURE Load_alphabetic(ch : IN character); Loads ch as the next alphabetic character. The order of loading determines the sorting values. PROCEDURE Load_variant(ch : IN character; Equ_ch : IN character; Equ_kind : IN Equivalence_kind); TYPE Equivalence_kind IS (Exact, Case_diff, Accented); PROCEDURE Load_variant(ch : IN character; Equ_str : IN string); Load_variant loads ch as a variant of Equ_ch or Equ_str. The interpretation of Equ_kind is: Exact: Exactly the same. There is no difference. What you use when you don't want case to be significant. Case_diff: Load ch as a lowercase variant of Equ_ch. There will be difference at level 4. Accented: Load ch as variant of Equ_ch at level 2. The latter version of Load_variant always loads ch at level 2. For simplify loading, the package also provides routines for loading a character and its ASCII lowercase equivalent simultaneously: PROCEDURE Set_case_significance(Flag : boolean); PROCEDURE Alpha_both_cases(ch : IN character); PROCEDURE Variant_both_cases(ch : IN character; Equ_ch : IN character); PROCEDURE Variant_both_cases(ch : IN character; Equ_str : IN string); With Set_case_significant you determine whether case should be significant when loading the pairs. Variant_both_cases loads ch at level 2. The loading operations raise Already_defined if an attempt is made to load a character twice. If Equ_ch or part of Equ_str is undefined, this gives the exception Undefined_equivalent. Transscription operations ------------------------- These routines translates a string to the internal coding. TYPE Transscripted_string(Max_length : natural) IS PRIVATE; PROCEDURE Transscribe(ch : IN character; Trans_str : OUT Transscripted_string); PROCEDURE Transscribe(Str : IN string; Trans_str : OUT Transscripted_string); If the transscription is too long, the routines will raise Transscription_error. Comparison operators: --------------------- FUNCTION "<=" (Left, Right : Transscripted_string) RETURN boolean; FUNCTION "<" (Left, Right : Transscripted_string) RETURN boolean; FUNCTION ">=" (Left, Right : Transscripted_string) RETURN boolean; FUNCTION ">" (Left, Right : Transscripted_string) RETURN boolean; I have only included operations for comparing transscripted strings. Of course there could be a set for uncoded strings too. Other function -------------- FUNCTION Is_letter(ch : character) RETURN boolean; The demonstration program ------------------------- The program takes the options: -8 Use ISO/Latin-1. If not present, use 7-bit ASCII with national replacements. -e Case is significant. When omitted, case is not significant. -LX Selects language. X should be one of the following: s or S: Swedish. (Default) d or D: Danish g: German1: "A, "O and "U sorts as A, O and U. G: German2: "A, "O and "U sorts as AE, OE and UE. f or F French In the definition routine I load space as the first alphabetic letter. This gives the result that "Smith, Tony" will sort before "Smithson, Alan". -- Erland Sommarskog ENEA Data, Stockholm sommar@enea.UUCP It could have been worse; it could have been Pepsi.