allbery@ncoast.UUCP (11/02/87)
This is to be posted in comp.sources.misc. Thank you. This posting contains a package for string-comparisons in fairly sophisticated way where reagrd to accents, non-letters and case is only taken if necessary. The user defines how each character should be sorted, whether if it is a letter, and whether it is a variant of another letter. See READ ME for complete description. The purpose is not to provide a facility, though, rather to demonstrate the idea. The code is in Ada, so you may have problem with using it straight off, if you don't have access to a compiler. Still you can take advantage of the ideas if you like. Comments and questions are welcome to: Erland Sommarskog ENEA Data, Stockholm sommar@enea ---------------------------------------------------------------------- #! /bin/sh # This is a shell archive, meaning: # 1. Remove everything above the #! /bin/sh line. # 2. Save the resulting text in a file. # 3. Execute the file with /bin/sh (not csh) to create: # READ ME # comline.a # define.a # latin1.a # main.a # natascii.a # strcompb.a # strcomps.a # This archive created: Fri Oct 30 23:01:51 1987 export PATH; PATH=/bin:/usr/bin:$PATH if test -f 'READ ME' then echo shar: "will not over-write existing file 'READ ME'" else cat << \SHAR_EOF > 'READ ME' The intention of this posting is not to provide a facility, but rather to demonstrate a technique to do string comparisons in a more sophisticated way than simply using ASCII values. Comments, questions etc are very welcome to: Erland Sommarskog ENEA Data, Stockholm sommar@enea.UUCP The posting contains seven files that can be divided into three groups: I: strcompS.a and strcompB.a The core of the posting. They contain a package for string comparisons. It has a character-transscription table to be loaded by the user and comparison operators for trans- scripted string. The exported routines are described below. StrcompS is the specification, whereas strcompB contains the package body. II: latin1.a and natascii.a They declare names for characters, to be used, for example, when defining a collating sequence for the package above. Latin1 declares names for the ISO standard 8859/1. Natascii declares names for national replacements of the ordinary ASCII set. III: define.a, comline.a and main.a An demonstration application that uses the string-comparison package. Define.a loads the character collating sequence. Comline.a reads the command line. Note that this file is bound to Verdix Ada for Unix and must be rewritten for another system. Main.a is the main program. It reads lines from standard input or a named file and writes the sorted lines to standard output when end-of-file is detected. You find a description of the options last in this file. You should compile the files in the order: latin1, natascii, strcompS, strcompB, define, comline, main. Four-dimensional sorting ------------------------ The string-comparison package compares strings at four levels: 1) Alphabetic 2) Accents 3) Non-letters 4) Difference in case What is an alphabetic etc is up to the user. He may define "$" being a letter with "(" as its lowercase variant if he likes. One level is only regarded if the level above have no difference. As an example I take T^ete-`a-t^ete (I assume a "normal" loading of the character table here.) For the first level we use TETEATETE, thus we remove the accents and the hyphens. On the next we re-insert the accents so we get T^ETE`AT^ETE On level three we only take the hyphens in regard. When comparing non-letters the package uses the simple ASCII values. The earlier a character comes, the lower is the sort value. Thus, "trans-scription" will precede "transscrip-tion". (Actually, as the implementation is done, the position is more important than the ASCII value.) On the last level we use T^ete`at^ete thus, the original writing with the hyphens removed. Note that the user can specify case to be insigificant. (This isn't a description on how the package is implemented, just a way of illustrating the result. In practice it's done a little more effective.) When defining accented variants it is possible to let a character be a variant of a string, in this way the AE ligature can be sorted as "AE". The opposite is not possible, and what worse is, a string can't have an alphabetic value. Thus the package is not able to sort languages as Spanish (CH and LL) correctly. The number characters are handled in a special way if you define them as alphabetics. A sequence of figures will read as one number and sort after all other alphabetics. (Even if they were defined as the first characters.) So you will get File1 File2 File10 File11 instead of the usual File1 File10 File11 File2 If you like to sort them as they are read, this is also possible. E.g. load "0" as a variant of "zero". The package contains the following routines: Load Operations --------------- PROCEDURE Load_alphabetic(ch : IN character); Loads ch as the next alphabetic character. The order of loading determines the sorting values. PROCEDURE Load_variant(ch : IN character; Equ_ch : IN character; Equ_kind : IN Equivalence_kind); TYPE Equivalence_kind IS (Exact, Case_diff, Accented); PROCEDURE Load_variant(ch : IN character; Equ_str : IN string); Load_variant loads ch as a variant of Equ_ch or Equ_str. The interpretation of Equ_kind is: Exact: Exactly the same. There is no difference. What you use when you don't want case to be significant. Case_diff: Load ch as a lowercase variant of Equ_ch. There will be difference at level 4. Accented: Load ch as variant of Equ_ch at level 2. The latter version of Load_variant always loads ch at level 2. For simplify loading, the package also provides routines for loading a character and its ASCII lowercase equivalent simultaneously: PROCEDURE Set_case_significance(Flag : boolean); PROCEDURE Alpha_both_cases(ch : IN character); PROCEDURE Variant_both_cases(ch : IN character; Equ_ch : IN character); PROCEDURE Variant_both_cases(ch : IN character; Equ_str : IN string); With Set_case_significant you determine whether case should be significant when loading the pairs. Variant_both_cases loads ch at level 2. The loading operations raise Already_defined if an attempt is made to load a character twice. If Equ_ch or part of Equ_str is undefined, this gives the exception Undefined_equivalent. Transscription operations ------------------------- These routines translates a string to the internal coding. TYPE Transscripted_string(Max_length : natural) IS PRIVATE; PROCEDURE Transscribe(ch : IN character; Trans_str : OUT Transscripted_string); PROCEDURE Transscribe(Str : IN string; Trans_str : OUT Transscripted_string); If the transscription is too long, the routines will raise Transscription_error. Comparison operators: --------------------- FUNCTION "<=" (Left, Right : Transscripted_string) RETURN boolean; FUNCTION "<" (Left, Right : Transscripted_string) RETURN boolean; FUNCTION ">=" (Left, Right : Transscripted_string) RETURN boolean; FUNCTION ">" (Left, Right : Transscripted_string) RETURN boolean; I have only included operations for comparing transscripted strings. Of course there could be a set for uncoded strings too. Other function -------------- FUNCTION Is_letter(ch : character) RETURN boolean; The demonstration program ------------------------- The program takes the options: -8 Use ISO/Latin-1. If not present, use 7-bit ASCII with national replacements. -e Case is significant. When omitted, case is not significant. -LX Selects language. X should be one of the following: s or S: Swedish. (Default) d or D: Danish g: German1: "A, "O and "U sorts as A, O and U. G: German2: "A, "O and "U sorts as AE, OE and UE. f or F French In the definition routine I load space as the first alphabetic letter. This gives the result that "Smith, Tony" will sort before "Smithson, Alan". SHAR_EOF fi if test -f 'comline.a' then echo shar: "will not over-write existing file 'comline.a'" else cat << \SHAR_EOF > 'comline.a' ---------------------------------------------------------------------- -- PROCEDURE Read_command_line -- ---------------------------------------------------------------------- -- This procedure reads the command line to get the options and the -- input file. You will probably have to replace it, unless you also -- use Verdix Ada system for Unix. WITH Define; Use Define; WITH Command_line; USE Command_line; -- Verdix package WITH Text_io; WITH IO_exceptions; PROCEDURE Read_command_line(Language : OUT Define.Languages; Exact : OUT boolean; Eightbit : OUT boolean) IS BEGIN FOR i IN 1..argc - 1 LOOP IF argv(i).s(1) = '-' THEN CASE argv(i).s(2) IS WHEN '8' => Eightbit := true; WHEN 'E' ! 'e' => Exact := true; WHEN 'L' ! 'l' => CASE argv(i).s(3) IS WHEN 's' ! 'S' => Language := Swedish; WHEN 'd' ! 'D' => Language := Danish; WHEN 'g' => Language := German1; WHEN 'G' => Language := German2; WHEN 'f' ! 'F' => Language := French; WHEN OTHERS => NULL; END CASE; WHEN OTHERS => Text_io.Put_line("Unknown option: " & argv(i).s); END CASE; ELSE DECLARE USE Text_io; Infile : File_type; BEGIN Open(Infile, In_file, argv(i).s); Set_input(Infile); EXCEPTION WHEN IO_exceptions.Name_error => Put_line(argv(i).s & " does not exsist"); END; END IF; END LOOP; END Read_command_line; SHAR_EOF fi if test -f 'define.a' then echo shar: "will not over-write existing file 'define.a'" else cat << \SHAR_EOF > 'define.a' ---------------------------------------------------------------------- -- Define collating sequence -- ---------------------------------------------------------------------- -- This package contains a procedure with the same name that demon- -- strates the use of the load operations in the String_comparison -- package. PACKAGE Define IS TYPE Languages IS (Swedish, Danish, German1, German2, French); -- German1 sort "A, "O and "U as A, O and U. German2 as AE, OE and UE. PROCEDURE Collatting_sequence(Language : IN Languages; Case_significant : IN boolean; Eightbit : IN boolean); END Define; WITH String_comparison; USE String_comparison; WITH ISO_Latin_1; USE ISO_Latin_1; WITH National_ASCII; USE National_ASCII; PACKAGE BODY Define IS PROCEDURE Collatting_sequence(Language : IN Languages; Case_significant : IN boolean; Eightbit : IN boolean) IS BEGIN -- Set the significane of case Set_case_significance(Case_significant); -- Load space as the first letter and the A to Z Load_alphabetic(' '); -- Load the letters from A to Z to begin with FOR ch IN 'A'..'V' LOOP Alpha_both_cases(ch); END LOOP; IF Language = Swedish THEN Variant_both_cases('W', 'V'); ELSE Alpha_both_cases('W'); END IF; FOR ch IN 'X'..'Z' LOOP Alpha_both_cases(ch); END LOOP; -- And so for the specific letters. Begin with the seven-bits IF NOT Eightbit THEN CASE Language IS WHEN Swedish => Alpha_both_cases(SW_UC_A_ring); Alpha_both_cases(SW_UC_A_dots); Alpha_both_cases(SW_UC_O_dots); Variant_both_cases(SW_UC_E_acute, 'E'); Variant_both_cases(SW_UC_U_dots, 'Y'); WHEN Danish => Alpha_both_cases(DA_UC_AE); Alpha_both_cases(DA_UC_O_oblique); Alpha_both_cases(DA_UC_A_ring); WHEN German1 => Variant_both_cases(GER_UC_A_dots, 'A'); Variant_both_cases(GER_UC_O_dots, 'O'); Variant_both_cases(GER_UC_U_dots, 'U'); Load_variant(GER_LC_s_sharp, "ss"); WHEN German2 => Variant_both_cases(GER_UC_A_dots, "AE"); Variant_both_cases(GER_UC_O_dots, "OE"); Variant_both_cases(GER_UC_U_dots, "UE"); Load_variant(GER_LC_s_sharp, "ss"); WHEN French => Load_variant(FR_LC_a_grave, 'a', Accented); Load_variant(FR_LC_c_cedilla, 'c', Accented); Load_variant(FR_LC_e_acute, 'e', Accented); Load_variant(FR_LC_u_grave, 'u', Accented); Load_variant(FR_LC_e_grave, 'e', Accented); END CASE; -- Now lets take the eightbit case, ISO-Latin/1. ELSE -- First we take characters that differs from langauge to language -- They are oA, "A, AE, "O, /O, and "U. CASE Language IS WHEN Swedish => Alpha_both_cases(UC_A_ring); Alpha_both_cases(UC_A_dots); Variant_both_cases(UC_AE, UC_A_dots); Alpha_both_cases(UC_O_dots); Variant_both_cases(UC_O_oblique, UC_O_dots); Variant_both_cases(UC_U_dots, 'Y'); WHEN Danish => Alpha_both_cases(UC_AE); Variant_both_cases(UC_A_dots, UC_AE); Alpha_both_cases(UC_O_oblique); Variant_both_cases(UC_O_dots, UC_O_oblique); Alpha_both_cases(UC_A_ring); Variant_both_cases(UC_U_dots, 'Y'); WHEN German1 ! French => Variant_both_cases(UC_A_dots, 'A'); Variant_both_cases(UC_O_dots, 'O'); Variant_both_cases(UC_U_dots, 'U'); Variant_both_cases(UC_A_ring, 'A'); Variant_both_cases(UC_O_oblique, 'O'); Variant_both_cases(UC_AE, "AE"); WHEN German2 => Variant_both_cases(UC_A_dots, "AE"); Variant_both_cases(UC_O_dots, "OE"); Variant_both_cases(UC_U_dots, "UE"); Variant_both_cases(UC_A_ring, 'A'); Variant_both_cases(UC_O_oblique, 'O'); Variant_both_cases(UC_AE, "AE"); END CASE; -- All other variants Variant_both_cases(UC_A_grave, 'A'); Variant_both_cases(UC_A_acute, 'A'); Variant_both_cases(UC_A_circum, 'A'); Variant_both_cases(UC_A_tilde, 'A'); Variant_both_cases(UC_C_cedilla, 'C'); Variant_both_cases(UC_E_grave, 'E'); Variant_both_cases(UC_E_acute, 'E'); Variant_both_cases(UC_E_circum, 'E'); Variant_both_cases(UC_E_dots, 'E'); Variant_both_cases(UC_Edh, 'D'); Variant_both_cases(UC_I_grave, 'I'); Variant_both_cases(UC_I_acute, 'I'); Variant_both_cases(UC_I_circum, 'I'); Variant_both_cases(UC_I_dots, 'I'); Variant_both_cases(UC_N_tilde, 'N'); Variant_both_cases(UC_O_grave, 'O'); Variant_both_cases(UC_O_acute, 'O'); Variant_both_cases(UC_O_circum, 'O'); Variant_both_cases(UC_O_tilde, 'O'); Load_variant(LC_s_sharp, "ss"); Variant_both_cases(UC_U_grave, 'U'); Variant_both_cases(UC_U_acute, 'U'); Variant_both_cases(UC_U_circum, 'U'); Variant_both_cases(UC_Y_acute, 'Y'); Load_variant(LC_y_dots, 'y', Accented); END IF; -- Finally the numbers FOR ch IN '0'..'9' LOOP Load_alphabetic(ch); END LOOP; END Collatting_sequence; END Define; SHAR_EOF fi if test -f 'latin1.a' then echo shar: "will not over-write existing file 'latin1.a'" else cat << \SHAR_EOF > 'latin1.a' ---------------------------------------------------------------------- -- PACKAGE ISO_Latin_1 -- ---------------------------------------------------------------------- -- This package defines names for the characters in the standard -- ISO 8859/1, known as Latin-1, that are not in the ASCII set, -- i.e. characters with codes >= 160. (Control characters 128-159 -- are excluded. WITH Unchecked_conversion; PACKAGE ISO_Latin_1 IS -- Implementation note: To define the constants within the existing -- character type I use Unchecked_conversion. Note that this is not -- legal Ada. Ada defines the character type as covering codes from -- 0 to 127. Thus, all these declarations should raise Constraint_error, -- however neither DEC Ada, nor Verdix for Unix do so. -- Note also that the Ada definition permits an implementation to -- restrict Unchecked_conversion. -- The proper way would be define a new enumeration type, however this -- requires more work, including a new Text_io. TYPE Byte IS NEW integer RANGE 0..255; FUNCTION Eight_bit IS NEW Unchecked_conversion(Byte, Character); No_break_space : CONSTANT character := Eight_bit(160); Exclaim_up_down : CONSTANT character := Eight_bit(161); Cent : CONSTANT character := Eight_bit(162); Pound : CONSTANT character := Eight_bit(163); Gen_currency : CONSTANT character := Eight_bit(164); Yen : CONSTANT character := Eight_bit(165); Broken_bar : CONSTANT character := Eight_bit(166); Paragraph : CONSTANT character := Eight_bit(167); Diaraesis : CONSTANT character := Eight_bit(168); Copyright : CONSTANT character := Eight_bit(169); Fem_ordinal : CONSTANT character := Eight_bit(170); L_angle_quote : CONSTANT character := Eight_bit(171); Not_sign : CONSTANT character := Eight_bit(172); Soft_hyphen : CONSTANT character := Eight_bit(173); Reg_trade : CONSTANT character := Eight_bit(174); Macron : CONSTANT character := Eight_bit(175); Degree : CONSTANT character := Eight_bit(176); Plus_minus : CONSTANT character := Eight_bit(177); Super_2 : CONSTANT character := Eight_bit(178); Super_3 : CONSTANT character := Eight_bit(179); Acute : CONSTANT character := Eight_bit(180); Mu : CONSTANT character := Eight_bit(181); Pilcrow : CONSTANT character := Eight_bit(182); Middle_dot : CONSTANT character := Eight_bit(183); Cedilla : CONSTANT character := Eight_bit(184); Super_1 : CONSTANT character := Eight_bit(185); Mask_ord : CONSTANT character := Eight_bit(186); R_angle_quote : CONSTANT character := Eight_bit(187); Quarter : CONSTANT character := Eight_bit(188); Half : CONSTANT character := Eight_bit(189); Three_quarter : CONSTANT character := Eight_bit(190); Query_up_down : CONSTANT character := Eight_bit(191); UC_A_grave : CONSTANT character := Eight_bit(192); UC_A_acute : CONSTANT character := Eight_bit(193); UC_A_circum : CONSTANT character := Eight_bit(194); UC_A_tilde : CONSTANT character := Eight_bit(195); UC_A_dots : CONSTANT character := Eight_bit(196); UC_A_ring : CONSTANT character := Eight_bit(197); UC_AE : CONSTANT character := Eight_bit(198); UC_C_cedilla : CONSTANT character := Eight_bit(199); UC_E_grave : CONSTANT character := Eight_bit(200); UC_E_acute : CONSTANT character := Eight_bit(201); UC_E_circum : CONSTANT character := Eight_bit(202); UC_E_dots : CONSTANT character := Eight_bit(203); UC_I_grave : CONSTANT character := Eight_bit(204); UC_I_acute : CONSTANT character := Eight_bit(205); UC_I_circum : CONSTANT character := Eight_bit(206); UC_I_dots : CONSTANT character := Eight_bit(207); UC_edh : CONSTANT character := Eight_bit(208); UC_N_tilde : CONSTANT character := Eight_bit(209); UC_O_grave : CONSTANT character := Eight_bit(210); UC_O_acute : CONSTANT character := Eight_bit(211); UC_O_circum : CONSTANT character := Eight_bit(212); UC_O_tilde : CONSTANT character := Eight_bit(213); UC_O_dots : CONSTANT character := Eight_bit(214); Mult_sign : CONSTANT character := Eight_bit(215); UC_O_oblique : CONSTANT character := Eight_bit(216); UC_U_grave : CONSTANT character := Eight_bit(217); UC_U_acute : CONSTANT character := Eight_bit(218); UC_U_circum : CONSTANT character := Eight_bit(219); UC_U_dots : CONSTANT character := Eight_bit(220); UC_Y_acute : CONSTANT character := Eight_bit(221); UC_thorn : CONSTANT character := Eight_bit(222); LC_s_sharp : CONSTANT character := Eight_bit(223); LC_a_grave : CONSTANT character := Eight_bit(224); LC_a_acute : CONSTANT character := Eight_bit(225); LC_a_circum : CONSTANT character := Eight_bit(226); LC_a_tilde : CONSTANT character := Eight_bit(227); LC_a_dots : CONSTANT character := Eight_bit(228); LC_a_ring : CONSTANT character := Eight_bit(229); LC_ae : CONSTANT character := Eight_bit(230); LC_c_cedilla : CONSTANT character := Eight_bit(231); LC_e_grave : CONSTANT character := Eight_bit(232); LC_e_acute : CONSTANT character := Eight_bit(233); LC_e_circum : CONSTANT character := Eight_bit(234); LC_e_dots : CONSTANT character := Eight_bit(235); LC_i_grave : CONSTANT character := Eight_bit(236); LC_i_acute : CONSTANT character := Eight_bit(237); LC_i_circum : CONSTANT character := Eight_bit(238); LC_i_dots : CONSTANT character := Eight_bit(239); LC_edh : CONSTANT character := Eight_bit(240); LC_n_tilde : CONSTANT character := Eight_bit(241); LC_o_grave : CONSTANT character := Eight_bit(242); LC_o_acute : CONSTANT character := Eight_bit(243); LC_o_circum : CONSTANT character := Eight_bit(244); LC_o_tilde : CONSTANT character := Eight_bit(245); LC_o_dots : CONSTANT character := Eight_bit(246); Div_sign : CONSTANT character := Eight_bit(247); LC_o_oblique : CONSTANT character := Eight_bit(248); LC_u_grave : CONSTANT character := Eight_bit(249); LC_u_acute : CONSTANT character := Eight_bit(250); LC_u_circum : CONSTANT character := Eight_bit(251); LC_u_dots : CONSTANT character := Eight_bit(252); LC_y_acute : CONSTANT character := Eight_bit(253); LC_thorn : CONSTANT character := Eight_bit(254); LC_y_dots : CONSTANT character := Eight_bit(255); END ISO_latin_1; SHAR_EOF fi if test -f 'main.a' then echo shar: "will not over-write existing file 'main.a'" else cat << \SHAR_EOF > 'main.a' ---------------------------------------------------------------------- -- Sort package and main program -- ---------------------------------------------------------------------- -- This file contains a sort package that uses the string-comparison -- package when sorting and the main program. The sort package is very -- simple, it contains just one routine for inserting into the tree -- and for writing the tree to standard output. PACKAGE Sort_package IS PROCEDURE Insert(Str : IN string); PROCEDURE Write_tree; END Sort_package; -- The main program. Reads line from standard input and insert them -- into the sort package. When end-of-fils is detected, write the -- tree. WITH Text_io; WITH IO_exceptions; WITH Sort_package; WITH Define; USE Define; WITH Read_command_line; PROCEDURE Main IS Language : Define.Languages := Swedish; Eightbit : boolean := false; Exact : boolean := false; Line : string(1..80); Len : natural; BEGIN Read_command_line(Language, Exact, Eightbit); Define.collatting_sequence(Language, Exact, Eightbit); LOOP Text_io.Get_line(Line, Len); Sort_package.Insert(Line(1..Len)); END LOOP; EXCEPTION WHEN IO_exceptions.End_error => Sort_package.Write_tree; END Main; -- Below the body of the sort package WITH Text_io; WITH String_comparison; USE String_comparison; PACKAGE BODY Sort_package IS TYPE Tree_entry(Key_size : positive; Str_len : natural); TYPE Tree_type IS ACCESS Tree_entry; TYPE Tree_entry(Key_size : positive; Str_len : natural) IS RECORD Left : Tree_type := NULL; Right : Tree_type := NULL; Key : Transscripted_string(Key_size); Str : string(1..Str_len); END RECORD; Tree : Tree_type := NULL; -- Internal recursive insertion procedure. Called by the exported PROCEDURE Insert(Tree : IN OUT Tree_type; Key : IN Transscripted_string; Str : IN string) IS BEGIN IF Tree /= NULL THEN IF Key < Tree.Key THEN Insert(Tree.left, Key, Str); ELSIF Key > Tree.Key THEN Insert(Tree.right, Key, Str); END IF; ELSE Tree := NEW Tree_entry(Key.Max_length, Str'length); Tree.Key := Key; Tree.Str := Str; END IF; END Insert; -- Exported Insert PROCEDURE Insert(Str : IN string) IS Transscript : Transscripted_string(Str'length + 20); BEGIN Transscribe(Str, Transscript); Insert(Tree, Transscript, Str); EXCEPTION WHEN Transscription_error => Text_io.Put_line(Str); Text_io.Put_line("This line has too long transscription. Skipped."); END Insert; -- This procedure travserse the tree and writes all entries on standard output PROCEDURE Write_tree(Tree : IN Tree_type) IS BEGIN IF Tree /= NULL THEN Write_tree(Tree.Left); Text_io.Put_line(Tree.Str); Write_tree(Tree.Right); END IF; END Write_tree; -- Exported Write_tree; PROCEDURE Write_tree IS BEGIN Write_tree(Tree); END; END Sort_package; SHAR_EOF fi if test -f 'natascii.a' then echo shar: "will not over-write existing file 'natascii.a'" else cat << \SHAR_EOF > 'natascii.a' ---------------------------------------------------------------------- -- PACKAGE National ASCII -- ---------------------------------------------------------------------- -- This package declares alternate names for the ASCII codes -- 64, 91-94, 96 and 123-126 to be used when when these codes refers -- to national characters. The names are restricted to letters. -- Languages covered: Swedish/Finnish, Danish/Norwegian, German, -- French and Italian. PACKAGE National_ASCII IS -- Swedish and Finnish SW_UC_E_acute : CONSTANT character := '@'; SW_UC_A_ring : CONSTANT character := ']'; SW_UC_A_dots : CONSTANT character := '['; SW_UC_O_dots : CONSTANT character := '\'; SW_UC_U_dots : CONSTANT character := '^'; SW_LC_e_acute : CONSTANT character := '`'; SW_LC_a_ring : CONSTANT character := '}'; SW_LC_a_dots : CONSTANT character := '{'; SW_LC_o_dots : CONSTANT character := '|'; SW_LC_u_dots : CONSTANT character := '~'; -- Danish and Norwegian DA_UC_AE : CONSTANT character := '['; DA_UC_O_oblique : CONSTANT character := '\'; DA_UC_A_ring : CONSTANT character := ']'; DA_UC_U_dots : CONSTANT character := '^'; DA_LC_ae : CONSTANT character := '{'; DA_LC_o_oblique : CONSTANT character := '|'; DA_LC_a_ring : CONSTANT character := '}'; DA_LC_u_dots : CONSTANT character := '~'; -- German GER_UC_A_dots : CONSTANT character := '['; GER_UC_O_dots : CONSTANT character := '\'; GER_UC_U_dots : CONSTANT character := ']'; GER_LC_a_dots : CONSTANT character := '{'; GER_LC_o_dots : CONSTANT character := '|'; GER_LC_u_dots : CONSTANT character := '}'; GER_LC_s_sharp : CONSTANT character := '~'; -- French FR_LC_a_grave : CONSTANT character := '@'; FR_LC_c_cedilla : CONSTANT character := '\'; FR_LC_e_acute : CONSTANT character := '{'; FR_LC_u_grave : CONSTANT character := '|'; FR_LC_e_grave : CONSTANT character := '}'; -- Italian IT_LC_A_ring : CONSTANT character := ']'; IT_LC_u_grave : CONSTANT character := '`'; IT_LC_a_grave : CONSTANT character := '}'; IT_LC_o_grave : CONSTANT character := '{'; IT_LC_e_grave : CONSTANT character := '|'; IT_LC_i_grave : CONSTANT character := '~'; END National_ASCII; SHAR_EOF fi if test -f 'strcompb.a' then echo shar: "will not over-write existing file 'strcompb.a'" else cat << \SHAR_EOF > 'strcompb.a' ---------------------------------------------------------------------- -- BODY string_comparison -- ---------------------------------------------------------------------- -- This file contains the implementation part of the string comparison -- package. PACKAGE BODY string_comparison IS -- CONTENTS -- -------- -- Type declarations and simple functions -- Internal Load operations -- Exported load operations -- Internal routines for transscribing numbers -- Exported transscription operations -- Internal comparison procedures -- Exportered string comparators -- The transscription table -- The translation of a character is a string. This is for characters -- like the AE ligature. Also useful is you want "0" = "zero". TYPE Transscript_entry(Length : positive) IS RECORD Alphabetic : Natural_string(1..Length) := (OTHERS => 0); Accent : Natural_string(1..Length) := (OTHERS => 0); Case_variant : boolean := false; END RECORD; TYPE Entry_ptr IS ACCESS Transscript_entry; -- Pointer to allow different sizes -- The index in the table is the ordinal number. Ada's character type is -- limited to 127. Char_table : ARRAY (0..255) OF Entry_ptr := (OTHERS => NULL); -- Other types -- This type is for internal comparison functions TYPE Relation_type IS (Less_than, Equal, Greater_than); -- Range for the number characters SUBTYPE Numbers IS integer RANGE character'pos('0')..character'pos('9'); -- Variables -- Case significance Case_significant : boolean := true; -- Last used codes Last_alpha_code : integer := 0; Last_accent_code : integer := 0; -- When storing an alphabetic we increment Last_alpha_code, when loading -- a accent variant we increment Last_accent_code. -- Simple functions FUNCTION Is_letter(ch : character) RETURN boolean IS BEGIN RETURN Char_table(character'pos(ch)).Length > 0; END; -- Set case significance for the double-case load operations PROCEDURE Set_case_significance(Flag : boolean) IS BEGIN Case_significant := Flag; END; -- Internal Load operations -- These take integer parametes. The exported routines call these. -- We're having integer to avoid problems with characters over 127. PROCEDURE Load_alphabetic(ch : integer) IS -- Load ch in the table as a one without any Accent part. If ch is already -- defined, raise Already defined. BEGIN IF Char_table(ch) /= NULL THEN RAISE Already_defined; END IF; Char_table(ch) := NEW Transscript_entry(1); Last_alpha_code := Last_alpha_code + 1; Char_table(ch).Alphabetic(1) := Last_alpha_code; END Load_alphabetic; PROCEDURE Load_variant(ch : IN integer; Equ_ch : IN integer; Equ_kind : IN Equivalence_kind) IS -- Load ch as an variant of Equ_ch. Equ_ch must be defined or else -- we raise Undefined_equivalent. BEGIN IF Char_table(ch) /= NULL THEN RAISE Already_defined; END IF; IF Char_table(Equ_ch) = NULL THEN RAISE Undefined_equivalent; END IF; Char_table(ch) := NEW Transscript_entry(Char_table(Equ_ch).Length); Char_table(ch).Alphabetic := Char_table(Equ_ch).Alphabetic; Char_table(ch).Accent := Char_table(Equ_ch).Accent; Char_table(ch).Case_variant := Char_table(Equ_ch).Case_variant; -- Actually: Char_table(ch).all := Char_table(Equ_ch).all; -- Alas, Verdix Ada can't handle this properly CASE Equ_kind IS WHEN Exact => NULL; WHEN Case_diff => Char_table(ch).Case_variant := true; WHEN Accented => Last_accent_code := Last_accent_code + 1; Char_table(ch).Accent(1) := Last_accent_code; END CASE; END Load_variant; PROCEDURE Load_variant(ch : IN integer; Equ_str : IN Natural_string) IS -- Load ch as an accented letter (digraph) of Equ_str. If not all -- characters in Equ_str are deifined, raise Undefined_equivalent. BEGIN IF Char_table(ch) /= NULL THEN RAISE Already_defined; END IF; FOR i IN Equ_str'range LOOP IF Char_table(Equ_str(i)) = NULL THEN RAISE Undefined_equivalent; END IF; END LOOP; Char_table(ch) := NEW Transscript_entry(Equ_str'length); FOR i IN Equ_str'range LOOP Char_table(ch).Alphabetic(i) := Char_table(Equ_str(i)).Alphabetic(1); Last_accent_code := Last_accent_code + 1; Char_table(ch).Accent(i) := Last_accent_code; END LOOP; END Load_variant; -- The exported load operations PROCEDURE Load_alphabetic(ch : IN character) IS BEGIN Load_alphabetic(character'pos(ch)); END Load_alphabetic; PROCEDURE Load_variant(ch : IN character; Equ_ch : IN character; Equ_kind : IN Equivalence_kind) IS BEGIN Load_variant(character'pos(ch), character'pos(Equ_ch), Equ_kind); END Load_variant; PROCEDURE Load_variant(ch : IN character; Equ_str : IN string) IS Equ_int : Natural_string(Equ_str'range); BEGIN FOR i IN Equ_str'range LOOP Equ_int(i) := character'pos(Equ_str(i)); END LOOP; Load_variant(character'pos(ch), Equ_int); END Load_variant; -- Exported double-case load operations. PROCEDURE Alpha_both_cases(ch : IN character) IS Int_ch : integer := character'pos(ch); BEGIN Load_alphabetic(Int_ch); IF Case_significant THEN Load_variant(Int_ch + 32, Int_ch, Case_diff); ELSE Load_variant(Int_ch + 32, Int_ch, Exact); END IF; END Alpha_both_cases; PROCEDURE Variant_both_cases(ch : IN character; Equ_ch : IN character) IS Int_ch : integer := character'pos(ch); BEGIN Load_variant(Int_ch, character'pos(Equ_ch), Accented); IF Case_significant THEN Load_variant(Int_ch + 32, Int_ch, Case_diff); ELSE Load_variant(Int_ch + 32, Int_ch, Exact); END IF; END Variant_both_cases; PROCEDURE Variant_both_cases(ch : IN character; Equ_str : IN string) IS Int_ch : integer := character'pos(ch); BEGIN Load_variant(ch, Equ_str); IF Case_significant THEN Load_variant(Int_ch + 32, Int_ch, Case_diff); ELSE Load_variant(Int_ch + 32, Int_ch, Exact); END IF; END Variant_both_cases; -- Internal procedure for transscribing numbers PROCEDURE Get_number(Str : IN string; Str_ix : IN OUT integer; Number : OUT integer) IS -- Assume Str(Str_ix) is a number. Read as long there are numbers. -- Leave Str_ix at the last number character. No_in_str : natural := 0; ch : integer := character'pos(Str(Str_ix)); BEGIN WHILE ch IN Numbers LOOP No_in_str := 10 * No_in_str + ch - Numbers'first; IF Str_ix + 1 IN Str'range THEN Str_ix := Str_ix + 1; ch := character'pos(Str(Str_ix)); ELSE ch := 0; END IF; END LOOP; Number := No_in_str; EXCEPTION WHEN Numeric_error => RAISE Transscription_error; END; -- Exported transscription operations PROCEDURE Transscribe(ch : IN character; Trans_str : OUT Transscripted_string) IS BEGIN Transscribe( (1 => ch), Trans_str); END Transscribe; PROCEDURE Transscribe(Str : IN string; Trans_str : OUT Transscripted_string) IS -- Transscribe Str using the table. If the transscription does -- not fit into the out parameter, raise Transscription_error. -- Characters in Str that are not defined are regarded as non-letters. -- Non-letters are always stored at the their index in Str. -- Numbers are stored specially. ch : natural; -- Current character; Tr_ix : natural := 0; -- Index in Trans_str except the non-letter part. Str_ix : integer := Str'first; -- Index in Str and non-letter part. No_in_str : natural; BEGIN WHILE Str_ix IN Str'range LOOP ch := character'pos(Str(Str_ix)); IF Char_table(ch) /= NULL THEN IF Tr_ix + Char_table(ch).Length > Trans_str.Max_length THEN RAISE Transscription_error; END IF; IF ch NOT IN Numbers OR Char_table(ch).Accent(1) /= 0 THEN FOR i IN 1..Char_table(ch).Length LOOP Tr_ix := Tr_ix + 1; Trans_str.Alphabetic(Tr_ix) := Char_table(ch).Alphabetic(i); Trans_str.Case_part(Tr_ix) := Char_table(ch).Case_variant; Trans_str.Accents(Tr_ix) := Char_table(ch).Accent(i); END LOOP; ELSE Get_number(Str, Str_ix, No_in_str); Tr_ix := Tr_ix + 1; Trans_str.Alphabetic(Tr_ix) := 1000 + No_in_str; END IF; ELSE IF Str_ix > Trans_str.Max_length THEN RAISE Transscription_error; END IF; Trans_str.Non_letters(Str_ix) := ch; Trans_str.Non_letter_length := Str_ix; END IF; Str_ix := Str_ix + 1; END LOOP; Trans_str.Length := Tr_ix; END Transscribe; -- Internal comparison routines FUNCTION Relation(Left, Right : Natural_string) RETURN Relation_type IS -- This function is more os less obsolete. "<" etc should do the job. -- Verdix Ada can't this on integer arrays, unfortunately. i : positive := 1; Bug : EXCEPTION; -- Should not occur BEGIN WHILE (i <= Left'last AND i <= Right'last) AND THEN Left(i) = Right(i) LOOP i := i + 1; END LOOP; IF i > Left'last AND i > Right'last THEN RETURN Equal; ELSIF i > Left'last THEN RETURN Less_than; ELSIF i > Right'last THEN RETURN Greater_than; ELSIF Left(i) < Right(i) THEN RETURN Less_than; ELSIF Left(i) > Right(i) THEN RETURN Greater_than; ELSE RAISE Bug; -- This should not occur. END IF; END Relation; FUNCTION Relation(Left, Right : Transscripted_string) RETURN Relation_type IS -- Compare the parts in order. Continue as long as there is unequallity. Rel : Relation_type; BEGIN Rel := Relation(Left.Alphabetic(1..Left.Length), Right.Alphabetic(1..Right.Length)); IF Rel /= Equal THEN RETURN Rel; END IF; Rel := Relation(Left.Accents(1..Left.Length), Right.Accents(1..Right.Length)); IF Rel /= Equal THEN RETURN Rel; END IF; Rel := Relation(Left.Non_letters(1..Left.Non_letter_length), Right.Non_letters(1..Right.Non_letter_length)); IF Rel /= Equal THEN RETURN Rel; END IF; IF Left.Case_part(1..Left.Length) < Right.Case_part(1..Right.Length) THEN RETURN Less_than; ELSIF Left.Case_part(1..Left.Length) > Right.Case_part(1..Right.Length) THEN RETURN Greater_than; ELSE RETURN Equal; END IF; END Relation; -- Exported comparison operators FUNCTION "<=" (Left, Right : Transscripted_string) RETURN boolean IS BEGIN RETURN Relation(Left, Right) /= Greater_than; END; FUNCTION "<" (Left, Right : Transscripted_string) RETURN boolean IS BEGIN RETURN Relation(Left, Right) = Less_than; END; FUNCTION ">=" (Left, Right : Transscripted_string) RETURN boolean IS BEGIN RETURN Relation(Left, Right) /= Less_than; END; FUNCTION ">" (Left, Right : Transscripted_string) RETURN boolean IS BEGIN RETURN Relation(Left, Right) = Greater_than; END; END string_comparison; SHAR_EOF fi if test -f 'strcomps.a' then echo shar: "will not over-write existing file 'strcomps.a'" else cat << \SHAR_EOF > 'strcomps.a' ---------------------------------------------------------------------- -- SPECIFCATION String_comparison -- ---------------------------------------------------------------------- -- This package provides operations for comparing strings according to -- a user-defined scheme. -- The package contains operations for load an internal coding table, -- routines for coding strings and for comparing coded strings. PACKAGE String_comparison IS -- Load a character as the next in the primary colltating sequence PROCEDURE Load_alphabetic(ch : IN character); PROCEDURE Alpha_both_cases(ch : IN character); -- Load a variant of a character in the main sequence, on accent -- level, on case level or as exactly the same. TYPE Equivalence_kind IS (Exact, Case_diff, Accented); PROCEDURE Load_variant(ch : IN character; Equ_ch : IN character; Equ_kind : IN Equivalence_kind); -- The three below always load on accent level. PROCEDURE Load_variant(ch : IN character; Equ_str : IN string); PROCEDURE Variant_both_cases(ch : IN character; Equ_ch : IN character); PROCEDURE Variant_both_cases(ch : IN character; Equ_str : IN string); -- Exceptions that can be raised by the load operations Undefined_equivalent : EXCEPTION; Already_defined : EXCEPTION; -- Change case significance when loading both cases. Default is off. PROCEDURE Set_case_significance(Flag : boolean); -- Transscript type and coding operations TYPE Transscripted_string(Max_length : natural) IS PRIVATE; PROCEDURE Transscribe(ch : IN character; Trans_str : OUT Transscripted_string); PROCEDURE Transscribe(Str : IN string; Trans_str : OUT Transscripted_string); Transscription_error : EXCEPTION; -- Comparison operators FUNCTION "<=" (Left, Right : Transscripted_string) RETURN boolean; FUNCTION "<" (Left, Right : Transscripted_string) RETURN boolean; FUNCTION ">=" (Left, Right : Transscripted_string) RETURN boolean; FUNCTION ">" (Left, Right : Transscripted_string) RETURN boolean; -- Others FUNCTION Is_letter(ch : character) RETURN boolean; PRIVATE TYPE Natural_string IS ARRAY(integer RANGE <>) OF natural; TYPE Boolean_string IS ARRAY(integer RANGE <>) OF boolean; TYPE Transscripted_string(Max_length : natural) IS RECORD Length : natural := 0; Alphabetic : Natural_string(1..Max_length) := (OTHERS => 0); Accents : Natural_string(1..Max_length) := (OTHERS => 0); Case_part : Boolean_string(1..Max_length) := (OTHERS => false); Non_letter_length : natural := 0; Non_letters : Natural_string(1..Max_length) := (OTHERS => 256); END RECORD; END String_comparison; SHAR_EOF fi exit 0 # End of shell archive