[comp.sources.misc] A customable string-comparison package

allbery@ncoast.UUCP (11/02/87)

This is to be posted in comp.sources.misc. Thank you.

This posting contains a package for string-comparisons
in fairly sophisticated way where reagrd to accents, 
non-letters and case is only taken if necessary. 
  The user defines how each character should be sorted,
whether if it is a letter, and whether it is a variant
of another letter. See READ ME for complete description.
  The purpose is not to provide a facility, though, 
rather to demonstrate the idea. The code is in Ada,
so you may have problem with using it straight off,
if you don't have access to a compiler. Still you can 
take advantage of the ideas if you like. Comments and 
questions are welcome to:

Erland Sommarskog
ENEA Data, Stockholm
sommar@enea

----------------------------------------------------------------------
#! /bin/sh
# This is a shell archive, meaning:
# 1. Remove everything above the #! /bin/sh line.
# 2. Save the resulting text in a file.
# 3. Execute the file with /bin/sh (not csh) to create:
#	READ ME
#	comline.a
#	define.a
#	latin1.a
#	main.a
#	natascii.a
#	strcompb.a
#	strcomps.a
# This archive created: Fri Oct 30 23:01:51 1987
export PATH; PATH=/bin:/usr/bin:$PATH
if test -f 'READ ME'
then
	echo shar: "will not over-write existing file 'READ ME'"
else
cat << \SHAR_EOF > 'READ ME'
The intention of this posting is not to provide a facility, but 
rather to demonstrate a technique to do string comparisons 
in a more sophisticated way than simply using ASCII values.

Comments, questions etc are very welcome to:
Erland Sommarskog       
ENEA Data, Stockholm    
sommar@enea.UUCP        

The posting contains seven files that can be divided into three
groups:
I:   strcompS.a and strcompB.a
     The core of the posting. They contain a package for string 
     comparisons. It has a character-transscription table to be
     loaded by the user and comparison operators for trans-
     scripted string. The exported routines are described below. 
     StrcompS is the specification, whereas strcompB contains
     the package body.
II:  latin1.a and natascii.a
     They declare names for characters, to be used, for example,
     when defining a collating sequence for the package above.
     Latin1 declares names for the ISO standard 8859/1. Natascii
     declares names for national replacements of the ordinary 
     ASCII set.
III: define.a, comline.a and main.a
     An demonstration application that uses the string-comparison
     package. Define.a loads the character collating sequence.
       Comline.a reads the command line. Note that this file is
     bound to Verdix Ada for Unix and must be rewritten for another
     system.
       Main.a is the main program. It reads lines from standard 
     input or a named file and writes the sorted lines to standard
     output when end-of-file is detected. 
       You find a description of the options last in this file.
       
You should compile the files in the order: latin1, natascii,
strcompS, strcompB, define, comline, main.

Four-dimensional sorting
------------------------
       
The string-comparison package compares strings at four levels:
1) Alphabetic
2) Accents
3) Non-letters
4) Difference in case 
What is an alphabetic etc is up to the user. He may define "$" 
being a letter with "(" as its lowercase variant if he likes. 

One level is only regarded if the level above have no difference.
As an example I take 
      T^ete-`a-t^ete
(I assume a "normal" loading of the character table here.)
  For the first level we use TETEATETE, thus we remove the accents
and the hyphens. On the next we re-insert the accents so we get
      T^ETE`AT^ETE
On level three we only take the hyphens in regard. When comparing
non-letters the package uses the simple ASCII values. The earlier
a character comes, the lower is the sort value. Thus, "trans-scription"
will precede "transscrip-tion". (Actually, as the implementation 
is done, the position is more important than the ASCII value.)
  On the last level we use 
    T^ete`at^ete
thus, the original writing with the hyphens removed. Note that the
user can specify case to be insigificant.
  (This isn't a description on how the package is implemented, just 
a way of illustrating the result. In practice it's done a little
more effective.)

When defining accented variants it is possible to let a character
be a variant of a string, in this way the AE ligature can be sorted
as "AE". The opposite is not possible, and what worse is, a string
can't have an alphabetic value. Thus the package is not able to sort
languages as Spanish (CH and LL) correctly.

The number characters are handled in a special way if you define them 
as alphabetics. A sequence of figures will read as one number and sort 
after all other alphabetics. (Even if they were defined as the first 
characters.) So you will get
   File1   File2   File10   File11
instead of the usual
   File1   File10  File11   File2
  If you like to sort them as they are read, this is also possible.
E.g. load "0" as a variant of "zero".

The package contains the following routines:

Load Operations
---------------
PROCEDURE Load_alphabetic(ch : IN character);
Loads ch as the next alphabetic character. The order of loading
determines the sorting values.

PROCEDURE Load_variant(ch       : IN character;  
                       Equ_ch   : IN character;
                       Equ_kind : IN Equivalence_kind);
TYPE Equivalence_kind IS (Exact, Case_diff, Accented);   
PROCEDURE Load_variant(ch      : IN character;  
                       Equ_str : IN string);  
Load_variant loads ch as a variant of Equ_ch or Equ_str. The interpretation
of Equ_kind is:
Exact: Exactly the same. There is no difference. What you use when you
       don't want case to be significant.
Case_diff: Load ch as a lowercase variant of Equ_ch. There will be
           difference at level 4.
Accented:  Load ch as variant of Equ_ch at level 2.
The latter version of Load_variant always loads ch at level 2.

For simplify loading, the package also provides routines for loading
a character and its ASCII lowercase equivalent simultaneously:
PROCEDURE Set_case_significance(Flag : boolean);
PROCEDURE Alpha_both_cases(ch : IN character);  
PROCEDURE Variant_both_cases(ch     : IN character;
                             Equ_ch : IN character);
PROCEDURE Variant_both_cases(ch      : IN character;       
                             Equ_str : IN string);
With Set_case_significant you determine whether case should be
significant when loading the pairs. Variant_both_cases loads ch
at level 2.

The loading operations raise Already_defined if an attempt is
made to load a character twice. If Equ_ch or part of Equ_str is
undefined, this gives the exception Undefined_equivalent.

Transscription operations
-------------------------
These routines translates a string to the internal coding. 
TYPE Transscripted_string(Max_length : natural) IS PRIVATE;
PROCEDURE Transscribe(ch        : IN character;
                      Trans_str : OUT Transscripted_string);
PROCEDURE Transscribe(Str       : IN string;
                      Trans_str : OUT Transscripted_string);
If the transscription is too long, the routines will raise
Transscription_error.
                      
Comparison operators:
---------------------
FUNCTION "<=" (Left, Right : Transscripted_string) RETURN boolean;
FUNCTION "<"  (Left, Right : Transscripted_string) RETURN boolean;
FUNCTION ">=" (Left, Right : Transscripted_string) RETURN boolean;
FUNCTION ">"  (Left, Right : Transscripted_string) RETURN boolean;

I have only included operations for comparing transscripted 
strings. Of course there could be a set for uncoded strings too.

Other function
--------------
FUNCTION Is_letter(ch : character) RETURN boolean;

The demonstration program
-------------------------
The program takes the options:
-8  Use ISO/Latin-1. If not present, use 7-bit ASCII with national
    replacements.
-e  Case is significant. When omitted, case is not significant.
-LX Selects language. X should be one of the following:
    s or S: Swedish. (Default)
    d or D: Danish
    g:      German1: "A, "O and "U sorts as A, O and U.
    G:      German2: "A, "O and "U sorts as AE, OE and UE.
    f or F  French
   
In the definition routine I load space as the first alphabetic
letter. This gives the result that "Smith, Tony" will sort
before "Smithson, Alan".
SHAR_EOF
fi
if test -f 'comline.a'
then
	echo shar: "will not over-write existing file 'comline.a'"
else
cat << \SHAR_EOF > 'comline.a'
----------------------------------------------------------------------
--                 PROCEDURE Read_command_line                      --
----------------------------------------------------------------------
-- This procedure reads the command line to get the options and the
-- input file. You will probably have to replace it, unless you also
-- use Verdix Ada system for Unix.
WITH Define; Use Define;
WITH Command_line; USE Command_line;       -- Verdix package
WITH Text_io;
WITH IO_exceptions;
PROCEDURE Read_command_line(Language : OUT Define.Languages;
                            Exact    : OUT boolean;
                            Eightbit : OUT boolean) IS 
BEGIN
   FOR i IN 1..argc - 1 LOOP
      IF argv(i).s(1) = '-' THEN
         CASE argv(i).s(2) IS 
            WHEN '8'       => Eightbit := true;
            WHEN 'E' ! 'e' => Exact    := true;
            WHEN 'L' ! 'l' => CASE argv(i).s(3) IS
                                 WHEN 's' ! 'S' => Language := Swedish;
                                 WHEN 'd' ! 'D' => Language := Danish;
                                 WHEN 'g'       => Language := German1;
                                 WHEN 'G'       => Language := German2;
                                 WHEN 'f' ! 'F' => Language := French;
                                 WHEN OTHERS    => NULL; 
                              END CASE;                      
            WHEN OTHERS    => Text_io.Put_line("Unknown option: " & argv(i).s);
         END CASE;
      ELSE
         DECLARE
            USE Text_io;
            Infile : File_type;
         BEGIN
            Open(Infile, In_file, argv(i).s);
            Set_input(Infile);
         EXCEPTION
            WHEN IO_exceptions.Name_error => 
                 Put_line(argv(i).s & " does not exsist");
         END;
      END IF;
   END LOOP;
END Read_command_line;
SHAR_EOF
fi
if test -f 'define.a'
then
	echo shar: "will not over-write existing file 'define.a'"
else
cat << \SHAR_EOF > 'define.a'
----------------------------------------------------------------------
--                  Define collating sequence                       --
----------------------------------------------------------------------
-- This package contains a procedure with the same name that demon-
-- strates the use of the load operations in the String_comparison 
-- package. 


PACKAGE Define IS
   TYPE Languages IS (Swedish, Danish, German1, German2, French);
   -- German1 sort "A, "O and "U as A, O  and U. German2 as AE, OE and UE.
   PROCEDURE Collatting_sequence(Language         : IN Languages;
                                 Case_significant : IN boolean;
                                 Eightbit         : IN boolean);
END Define;


WITH String_comparison; USE String_comparison;
WITH ISO_Latin_1;       USE ISO_Latin_1;
WITH National_ASCII;    USE National_ASCII;
PACKAGE BODY Define IS
   PROCEDURE Collatting_sequence(Language         : IN Languages;
                                 Case_significant : IN boolean;
                                 Eightbit         : IN boolean) IS
   BEGIN
      -- Set the significane of case
      Set_case_significance(Case_significant);
   
      -- Load space as the first letter and the A to Z
      Load_alphabetic(' ');
   
      -- Load the letters from A to Z to begin with
      FOR ch IN 'A'..'V' LOOP
         Alpha_both_cases(ch);
      END LOOP;
      IF Language = Swedish THEN
         Variant_both_cases('W', 'V');  
      ELSE
         Alpha_both_cases('W');
      END IF;
      FOR ch IN 'X'..'Z' LOOP
         Alpha_both_cases(ch);
      END LOOP;          
   
      -- And so for the specific letters. Begin with the seven-bits
      IF NOT Eightbit THEN
         CASE Language IS
            WHEN Swedish =>  Alpha_both_cases(SW_UC_A_ring);
                             Alpha_both_cases(SW_UC_A_dots);
                             Alpha_both_cases(SW_UC_O_dots);
                             Variant_both_cases(SW_UC_E_acute, 'E');
                             Variant_both_cases(SW_UC_U_dots, 'Y');
            WHEN Danish  =>  Alpha_both_cases(DA_UC_AE);
                             Alpha_both_cases(DA_UC_O_oblique); 
                             Alpha_both_cases(DA_UC_A_ring);
            WHEN German1 =>  Variant_both_cases(GER_UC_A_dots, 'A');
                             Variant_both_cases(GER_UC_O_dots, 'O');
                             Variant_both_cases(GER_UC_U_dots, 'U');
                             Load_variant(GER_LC_s_sharp, "ss");
            WHEN German2 =>  Variant_both_cases(GER_UC_A_dots, "AE");
                             Variant_both_cases(GER_UC_O_dots, "OE");
                             Variant_both_cases(GER_UC_U_dots, "UE");
                             Load_variant(GER_LC_s_sharp, "ss");
            WHEN French  =>  Load_variant(FR_LC_a_grave, 'a', Accented);
                             Load_variant(FR_LC_c_cedilla, 'c', Accented);
                             Load_variant(FR_LC_e_acute, 'e', Accented);
                             Load_variant(FR_LC_u_grave, 'u', Accented);
                             Load_variant(FR_LC_e_grave, 'e', Accented);
         END CASE;
-- Now lets take the eightbit case, ISO-Latin/1.
      ELSE                                          
         -- First we take characters that differs from langauge to language
         -- They are oA, "A, AE, "O, /O, and "U.
         CASE Language IS
            WHEN Swedish  => Alpha_both_cases(UC_A_ring);
                             Alpha_both_cases(UC_A_dots);
                             Variant_both_cases(UC_AE, UC_A_dots); 
                             Alpha_both_cases(UC_O_dots);        
                             Variant_both_cases(UC_O_oblique, UC_O_dots);
                             Variant_both_cases(UC_U_dots, 'Y');
            WHEN Danish   => Alpha_both_cases(UC_AE);
                             Variant_both_cases(UC_A_dots, UC_AE);
                             Alpha_both_cases(UC_O_oblique);
                             Variant_both_cases(UC_O_dots, UC_O_oblique);
                             Alpha_both_cases(UC_A_ring);
                             Variant_both_cases(UC_U_dots, 'Y');       
            WHEN German1 !
                 French   => Variant_both_cases(UC_A_dots, 'A');
                             Variant_both_cases(UC_O_dots, 'O');   
                             Variant_both_cases(UC_U_dots, 'U');
                             Variant_both_cases(UC_A_ring, 'A');
                             Variant_both_cases(UC_O_oblique, 'O');
                             Variant_both_cases(UC_AE, "AE"); 
            WHEN German2  => Variant_both_cases(UC_A_dots, "AE");
                             Variant_both_cases(UC_O_dots, "OE");   
                             Variant_both_cases(UC_U_dots, "UE");
                             Variant_both_cases(UC_A_ring, 'A');
                             Variant_both_cases(UC_O_oblique, 'O');
                             Variant_both_cases(UC_AE, "AE"); 
         END CASE;
       
         -- All other variants 
         Variant_both_cases(UC_A_grave, 'A');
         Variant_both_cases(UC_A_acute, 'A');
         Variant_both_cases(UC_A_circum, 'A');
         Variant_both_cases(UC_A_tilde, 'A');
         
         Variant_both_cases(UC_C_cedilla, 'C');
         
         Variant_both_cases(UC_E_grave, 'E');
         Variant_both_cases(UC_E_acute, 'E');
         Variant_both_cases(UC_E_circum, 'E');
         Variant_both_cases(UC_E_dots, 'E');
                                   
         Variant_both_cases(UC_Edh, 'D');
         
         Variant_both_cases(UC_I_grave, 'I');
         Variant_both_cases(UC_I_acute, 'I');
         Variant_both_cases(UC_I_circum, 'I');
         Variant_both_cases(UC_I_dots, 'I');
         
         Variant_both_cases(UC_N_tilde, 'N');
         
         Variant_both_cases(UC_O_grave, 'O');
         Variant_both_cases(UC_O_acute, 'O');
         Variant_both_cases(UC_O_circum, 'O');
         Variant_both_cases(UC_O_tilde, 'O');
                        
         Load_variant(LC_s_sharp, "ss");
         
         Variant_both_cases(UC_U_grave, 'U');
         Variant_both_cases(UC_U_acute, 'U');
         Variant_both_cases(UC_U_circum, 'U');
             
         Variant_both_cases(UC_Y_acute, 'Y');
         Load_variant(LC_y_dots, 'y', Accented);
      END IF;
      
      -- Finally the numbers
      FOR ch IN '0'..'9' LOOP
         Load_alphabetic(ch);
      END LOOP;
   END Collatting_sequence;
END Define;
SHAR_EOF
fi
if test -f 'latin1.a'
then
	echo shar: "will not over-write existing file 'latin1.a'"
else
cat << \SHAR_EOF > 'latin1.a'
----------------------------------------------------------------------
--                     PACKAGE ISO_Latin_1                          --
----------------------------------------------------------------------
-- This package defines names for the characters in the standard
-- ISO 8859/1, known as Latin-1, that are not in the ASCII set, 
-- i.e. characters with codes >= 160. (Control characters 128-159
-- are excluded.

WITH Unchecked_conversion;
PACKAGE ISO_Latin_1 IS
   -- Implementation note: To define the constants within the existing 
   -- character type I use Unchecked_conversion. Note that this is not 
   -- legal Ada. Ada defines the character type as covering codes from 
   -- 0 to 127. Thus, all these declarations should raise Constraint_error, 
   -- however neither DEC Ada, nor Verdix for Unix do so. 
   --   Note also that the Ada definition permits an implementation to 
   -- restrict Unchecked_conversion.
   -- The proper way would be define a new enumeration type, however this
   -- requires more work, including a new Text_io.
                          
   TYPE Byte IS NEW integer RANGE 0..255;
   FUNCTION Eight_bit IS NEW Unchecked_conversion(Byte, Character);
    
   No_break_space  : CONSTANT character := Eight_bit(160);
   Exclaim_up_down : CONSTANT character := Eight_bit(161);
   Cent            : CONSTANT character := Eight_bit(162);
   Pound           : CONSTANT character := Eight_bit(163);
   Gen_currency    : CONSTANT character := Eight_bit(164);
   Yen             : CONSTANT character := Eight_bit(165);
   Broken_bar      : CONSTANT character := Eight_bit(166);
   Paragraph       : CONSTANT character := Eight_bit(167);
   Diaraesis       : CONSTANT character := Eight_bit(168);
   Copyright       : CONSTANT character := Eight_bit(169);
   Fem_ordinal     : CONSTANT character := Eight_bit(170);
   L_angle_quote   : CONSTANT character := Eight_bit(171);
   Not_sign        : CONSTANT character := Eight_bit(172);
   Soft_hyphen     : CONSTANT character := Eight_bit(173);
   Reg_trade       : CONSTANT character := Eight_bit(174);
   Macron          : CONSTANT character := Eight_bit(175);
   Degree          : CONSTANT character := Eight_bit(176);
   Plus_minus      : CONSTANT character := Eight_bit(177);
   Super_2         : CONSTANT character := Eight_bit(178);
   Super_3         : CONSTANT character := Eight_bit(179);
   Acute           : CONSTANT character := Eight_bit(180);
   Mu              : CONSTANT character := Eight_bit(181);
   Pilcrow         : CONSTANT character := Eight_bit(182);
   Middle_dot      : CONSTANT character := Eight_bit(183);
   Cedilla         : CONSTANT character := Eight_bit(184);
   Super_1         : CONSTANT character := Eight_bit(185);
   Mask_ord        : CONSTANT character := Eight_bit(186);
   R_angle_quote   : CONSTANT character := Eight_bit(187);
   Quarter         : CONSTANT character := Eight_bit(188);
   Half            : CONSTANT character := Eight_bit(189);
   Three_quarter   : CONSTANT character := Eight_bit(190);
   Query_up_down   : CONSTANT character := Eight_bit(191);
   UC_A_grave      : CONSTANT character := Eight_bit(192);
   UC_A_acute      : CONSTANT character := Eight_bit(193);
   UC_A_circum     : CONSTANT character := Eight_bit(194);
   UC_A_tilde      : CONSTANT character := Eight_bit(195);
   UC_A_dots       : CONSTANT character := Eight_bit(196);
   UC_A_ring       : CONSTANT character := Eight_bit(197);
   UC_AE           : CONSTANT character := Eight_bit(198);
   UC_C_cedilla    : CONSTANT character := Eight_bit(199);
   UC_E_grave      : CONSTANT character := Eight_bit(200);
   UC_E_acute      : CONSTANT character := Eight_bit(201);
   UC_E_circum     : CONSTANT character := Eight_bit(202);
   UC_E_dots       : CONSTANT character := Eight_bit(203);
   UC_I_grave      : CONSTANT character := Eight_bit(204);
   UC_I_acute      : CONSTANT character := Eight_bit(205);
   UC_I_circum     : CONSTANT character := Eight_bit(206);
   UC_I_dots       : CONSTANT character := Eight_bit(207);
   UC_edh          : CONSTANT character := Eight_bit(208);
   UC_N_tilde      : CONSTANT character := Eight_bit(209);
   UC_O_grave      : CONSTANT character := Eight_bit(210);
   UC_O_acute      : CONSTANT character := Eight_bit(211);
   UC_O_circum     : CONSTANT character := Eight_bit(212);
   UC_O_tilde      : CONSTANT character := Eight_bit(213);
   UC_O_dots       : CONSTANT character := Eight_bit(214);
   Mult_sign       : CONSTANT character := Eight_bit(215);
   UC_O_oblique    : CONSTANT character := Eight_bit(216);
   UC_U_grave      : CONSTANT character := Eight_bit(217);
   UC_U_acute      : CONSTANT character := Eight_bit(218);
   UC_U_circum     : CONSTANT character := Eight_bit(219);
   UC_U_dots       : CONSTANT character := Eight_bit(220);
   UC_Y_acute      : CONSTANT character := Eight_bit(221);
   UC_thorn        : CONSTANT character := Eight_bit(222);
   LC_s_sharp      : CONSTANT character := Eight_bit(223);
   LC_a_grave      : CONSTANT character := Eight_bit(224);
   LC_a_acute      : CONSTANT character := Eight_bit(225);
   LC_a_circum     : CONSTANT character := Eight_bit(226);
   LC_a_tilde      : CONSTANT character := Eight_bit(227);
   LC_a_dots       : CONSTANT character := Eight_bit(228);
   LC_a_ring       : CONSTANT character := Eight_bit(229);
   LC_ae           : CONSTANT character := Eight_bit(230);
   LC_c_cedilla    : CONSTANT character := Eight_bit(231);
   LC_e_grave      : CONSTANT character := Eight_bit(232);
   LC_e_acute      : CONSTANT character := Eight_bit(233);
   LC_e_circum     : CONSTANT character := Eight_bit(234);
   LC_e_dots       : CONSTANT character := Eight_bit(235);
   LC_i_grave      : CONSTANT character := Eight_bit(236);
   LC_i_acute      : CONSTANT character := Eight_bit(237);
   LC_i_circum     : CONSTANT character := Eight_bit(238);
   LC_i_dots       : CONSTANT character := Eight_bit(239);
   LC_edh          : CONSTANT character := Eight_bit(240);
   LC_n_tilde      : CONSTANT character := Eight_bit(241);
   LC_o_grave      : CONSTANT character := Eight_bit(242);
   LC_o_acute      : CONSTANT character := Eight_bit(243);
   LC_o_circum     : CONSTANT character := Eight_bit(244);
   LC_o_tilde      : CONSTANT character := Eight_bit(245);
   LC_o_dots       : CONSTANT character := Eight_bit(246);
   Div_sign        : CONSTANT character := Eight_bit(247);
   LC_o_oblique    : CONSTANT character := Eight_bit(248);
   LC_u_grave      : CONSTANT character := Eight_bit(249);
   LC_u_acute      : CONSTANT character := Eight_bit(250);
   LC_u_circum     : CONSTANT character := Eight_bit(251);
   LC_u_dots       : CONSTANT character := Eight_bit(252);
   LC_y_acute      : CONSTANT character := Eight_bit(253);
   LC_thorn        : CONSTANT character := Eight_bit(254);
   LC_y_dots       : CONSTANT character := Eight_bit(255);
END ISO_latin_1;
SHAR_EOF
fi
if test -f 'main.a'
then
	echo shar: "will not over-write existing file 'main.a'"
else
cat << \SHAR_EOF > 'main.a'
----------------------------------------------------------------------
--               Sort package and main program                      --
----------------------------------------------------------------------
-- This file contains a sort package that uses the string-comparison
-- package when sorting and the main program. The sort package is very 
-- simple, it contains just one routine for inserting into the tree 
-- and for writing the tree to standard output.  
PACKAGE Sort_package IS 
   PROCEDURE Insert(Str : IN string); 
   PROCEDURE Write_tree;
END Sort_package;
                  
-- The main program. Reads line from standard input and insert them
-- into the sort package. When end-of-fils is detected, write the
-- tree.
WITH Text_io; 
WITH IO_exceptions;
WITH Sort_package;
WITH Define; USE Define;
WITH Read_command_line;
PROCEDURE Main IS
   Language : Define.Languages := Swedish;
   Eightbit : boolean   := false;
   Exact    : boolean   := false;
   Line     : string(1..80);
   Len      : natural;
BEGIN
   Read_command_line(Language, Exact, Eightbit);
   Define.collatting_sequence(Language, Exact, Eightbit);
   LOOP   
      Text_io.Get_line(Line, Len);
      Sort_package.Insert(Line(1..Len));
   END LOOP;                                    
EXCEPTION
   WHEN IO_exceptions.End_error => Sort_package.Write_tree;
END Main;

-- Below the body of the sort package
WITH Text_io;
WITH String_comparison; USE String_comparison;
PACKAGE BODY Sort_package IS 
   TYPE Tree_entry(Key_size : positive; Str_len  : natural);
   TYPE Tree_type IS ACCESS Tree_entry;
   TYPE Tree_entry(Key_size : positive; Str_len  : natural) IS 
        RECORD          
           Left   : Tree_type := NULL;
           Right  : Tree_type := NULL;
           Key    : Transscripted_string(Key_size);
           Str    : string(1..Str_len);
        END RECORD;
   Tree : Tree_type := NULL;
                
-- Internal recursive insertion procedure. Called by the exported
   PROCEDURE Insert(Tree : IN OUT Tree_type;
                    Key  : IN Transscripted_string;
                    Str  : IN string) IS
   BEGIN
      IF Tree /= NULL THEN
         IF Key < Tree.Key THEN
            Insert(Tree.left, Key, Str);
         ELSIF Key > Tree.Key THEN 
            Insert(Tree.right, Key, Str);
         END IF;
      ELSE
         Tree     := NEW Tree_entry(Key.Max_length, Str'length); 
         Tree.Key := Key;
         Tree.Str := Str;
      END IF;
   END Insert;

-- Exported Insert
   PROCEDURE Insert(Str : IN string) IS
   Transscript : Transscripted_string(Str'length + 20);
   BEGIN
      Transscribe(Str, Transscript);   
      Insert(Tree, Transscript, Str);
   EXCEPTION
      WHEN Transscription_error =>
          Text_io.Put_line(Str);
          Text_io.Put_line("This line has too long transscription. Skipped.");
   END Insert;

-- This procedure travserse the tree and writes all entries on standard output
   PROCEDURE Write_tree(Tree : IN Tree_type) IS
   BEGIN                  
      IF Tree /= NULL THEN
         Write_tree(Tree.Left);
         Text_io.Put_line(Tree.Str);
         Write_tree(Tree.Right);
      END IF;
   END Write_tree;

-- Exported Write_tree;
   PROCEDURE Write_tree IS
   BEGIN
      Write_tree(Tree);
   END;
    
END Sort_package;
SHAR_EOF
fi
if test -f 'natascii.a'
then
	echo shar: "will not over-write existing file 'natascii.a'"
else
cat << \SHAR_EOF > 'natascii.a'
----------------------------------------------------------------------
--                      PACKAGE National ASCII                      --
----------------------------------------------------------------------
-- This package declares alternate names for the ASCII codes
-- 64, 91-94, 96 and 123-126 to be used when when these codes refers 
-- to national characters. The names are restricted to letters. 
-- Languages covered: Swedish/Finnish, Danish/Norwegian, German, 
-- French and Italian.

PACKAGE National_ASCII IS

-- Swedish and Finnish
   SW_UC_E_acute   : CONSTANT character := '@';
   SW_UC_A_ring    : CONSTANT character := ']';
   SW_UC_A_dots    : CONSTANT character := '[';
   SW_UC_O_dots    : CONSTANT character := '\';
   SW_UC_U_dots    : CONSTANT character := '^';
   SW_LC_e_acute   : CONSTANT character := '`';
   SW_LC_a_ring    : CONSTANT character := '}';
   SW_LC_a_dots    : CONSTANT character := '{';
   SW_LC_o_dots    : CONSTANT character := '|';
   SW_LC_u_dots    : CONSTANT character := '~';
                   
-- Danish and Norwegian
   DA_UC_AE        : CONSTANT character := '[';
   DA_UC_O_oblique : CONSTANT character := '\';
   DA_UC_A_ring    : CONSTANT character := ']';
   DA_UC_U_dots    : CONSTANT character := '^';
   DA_LC_ae        : CONSTANT character := '{';
   DA_LC_o_oblique : CONSTANT character := '|';
   DA_LC_a_ring    : CONSTANT character := '}';
   DA_LC_u_dots    : CONSTANT character := '~';
                   
-- German          
   GER_UC_A_dots   : CONSTANT character := '[';
   GER_UC_O_dots   : CONSTANT character := '\';
   GER_UC_U_dots   : CONSTANT character := ']';
   GER_LC_a_dots   : CONSTANT character := '{';
   GER_LC_o_dots   : CONSTANT character := '|';
   GER_LC_u_dots   : CONSTANT character := '}';
   GER_LC_s_sharp  : CONSTANT character := '~';
                   
-- French          
   FR_LC_a_grave   : CONSTANT character := '@';
   FR_LC_c_cedilla : CONSTANT character := '\';
   FR_LC_e_acute   : CONSTANT character := '{';
   FR_LC_u_grave   : CONSTANT character := '|';
   FR_LC_e_grave   : CONSTANT character := '}';
                   
-- Italian         
   IT_LC_A_ring    : CONSTANT character := ']';
   IT_LC_u_grave   : CONSTANT character := '`';
   IT_LC_a_grave   : CONSTANT character := '}';
   IT_LC_o_grave   : CONSTANT character := '{';
   IT_LC_e_grave   : CONSTANT character := '|';
   IT_LC_i_grave   : CONSTANT character := '~';
                   
END National_ASCII;
SHAR_EOF
fi
if test -f 'strcompb.a'
then
	echo shar: "will not over-write existing file 'strcompb.a'"
else
cat << \SHAR_EOF > 'strcompb.a'
----------------------------------------------------------------------
--                    BODY string_comparison                        --
----------------------------------------------------------------------
-- This file contains the implementation part of the string comparison
-- package.

PACKAGE BODY string_comparison IS

--   CONTENTS
--   --------
--      Type declarations and simple functions
--      Internal Load operations
--      Exported load operations
--      Internal routines for transscribing numbers
--      Exported transscription operations
--      Internal comparison procedures
--      Exportered string comparators
                     
-- The transscription table
   -- The translation of a character is a string. This is for characters 
   -- like the AE ligature. Also useful is you want "0" = "zero".
   TYPE Transscript_entry(Length : positive) IS 
       RECORD    
          Alphabetic   : Natural_string(1..Length) := (OTHERS => 0);
          Accent       : Natural_string(1..Length) := (OTHERS => 0);
          Case_variant : boolean := false;
       END RECORD;                        
   TYPE Entry_ptr IS ACCESS Transscript_entry; 
   -- Pointer to allow different sizes 

   -- The index in the table is the ordinal number. Ada's character type is
   -- limited to 127.
   Char_table : ARRAY (0..255) OF Entry_ptr := (OTHERS => NULL);
          

-- Other types
   -- This type is for internal comparison functions
   TYPE Relation_type IS (Less_than, Equal, Greater_than);

   -- Range for the number characters
   SUBTYPE Numbers IS integer RANGE character'pos('0')..character'pos('9');

-- Variables
   -- Case significance
   Case_significant : boolean := true;

   -- Last used codes 
   Last_alpha_code  : integer := 0; 
   Last_accent_code : integer := 0; 
   -- When storing an alphabetic we increment Last_alpha_code, when loading
   -- a accent variant we increment Last_accent_code.

-- Simple functions
   FUNCTION Is_letter(ch : character) RETURN boolean IS
   BEGIN
      RETURN Char_table(character'pos(ch)).Length > 0;
   END;

   -- Set case significance for the double-case load operations
   PROCEDURE Set_case_significance(Flag : boolean) IS
   BEGIN
      Case_significant := Flag;
   END;

-- Internal Load operations
   -- These take integer parametes. The exported routines call these.
   -- We're having integer to avoid problems with characters over 127.
   
   PROCEDURE Load_alphabetic(ch : integer) IS
   -- Load ch in the table as a one without any Accent part. If ch is already
   -- defined, raise Already defined.
   BEGIN
      IF Char_table(ch) /= NULL THEN
         RAISE Already_defined;   
      END IF;                  
      Char_table(ch) := NEW Transscript_entry(1); 
      Last_alpha_code := Last_alpha_code + 1;
      Char_table(ch).Alphabetic(1) := Last_alpha_code;
   END Load_alphabetic;
                                                          
   PROCEDURE Load_variant(ch       : IN integer;
                          Equ_ch   : IN integer;
                          Equ_kind : IN Equivalence_kind) IS
   -- Load ch as an variant of Equ_ch. Equ_ch must be defined or else 
   -- we raise Undefined_equivalent.
   BEGIN
      IF Char_table(ch) /= NULL THEN     
         RAISE Already_defined;   
      END IF;
      IF Char_table(Equ_ch) = NULL THEN
         RAISE Undefined_equivalent;
      END IF;                           
      Char_table(ch) := NEW Transscript_entry(Char_table(Equ_ch).Length); 
      Char_table(ch).Alphabetic   := Char_table(Equ_ch).Alphabetic;
      Char_table(ch).Accent       := Char_table(Equ_ch).Accent;
      Char_table(ch).Case_variant := Char_table(Equ_ch).Case_variant;
      -- Actually: Char_table(ch).all := Char_table(Equ_ch).all;
      -- Alas, Verdix Ada can't handle this properly
      CASE Equ_kind IS
         WHEN Exact     => NULL;                          
         WHEN Case_diff => Char_table(ch).Case_variant := true;
         WHEN Accented  => Last_accent_code         := Last_accent_code + 1;
                           Char_table(ch).Accent(1) := Last_accent_code;
      END CASE;
   END Load_variant;

   PROCEDURE Load_variant(ch      : IN integer;
                          Equ_str : IN Natural_string) IS
   -- Load ch as an accented letter (digraph) of Equ_str. If not all 
   -- characters in Equ_str are deifined, raise Undefined_equivalent.
   BEGIN
      IF Char_table(ch) /= NULL THEN     
         RAISE Already_defined;   
      END IF;
      FOR i IN Equ_str'range LOOP
         IF Char_table(Equ_str(i)) = NULL THEN
            RAISE Undefined_equivalent;
         END IF;
      END LOOP;                      
      Char_table(ch) := NEW Transscript_entry(Equ_str'length);
      FOR i IN Equ_str'range LOOP
         Char_table(ch).Alphabetic(i) := Char_table(Equ_str(i)).Alphabetic(1); 
         Last_accent_code := Last_accent_code + 1;   
         Char_table(ch).Accent(i) := Last_accent_code; 
      END LOOP;                      
   END Load_variant;

-- The exported load operations
   PROCEDURE Load_alphabetic(ch : IN character) IS
   BEGIN
      Load_alphabetic(character'pos(ch));
   END Load_alphabetic;

   PROCEDURE Load_variant(ch       : IN character;  
                          Equ_ch   : IN character;
                          Equ_kind : IN Equivalence_kind) IS
   BEGIN
      Load_variant(character'pos(ch), character'pos(Equ_ch), Equ_kind);
   END Load_variant;

   PROCEDURE Load_variant(ch       : IN character;  
                          Equ_str  : IN string) IS
   Equ_int : Natural_string(Equ_str'range);
   BEGIN
      FOR i IN Equ_str'range LOOP
         Equ_int(i) := character'pos(Equ_str(i));
      END LOOP;
      Load_variant(character'pos(ch), Equ_int);
   END Load_variant;
   

-- Exported double-case load operations. 
   PROCEDURE Alpha_both_cases(ch : IN character) IS
   Int_ch : integer := character'pos(ch);
   BEGIN
      Load_alphabetic(Int_ch);
      IF Case_significant THEN
         Load_variant(Int_ch + 32, Int_ch, Case_diff);
      ELSE
         Load_variant(Int_ch + 32, Int_ch, Exact);
      END IF;
   END Alpha_both_cases;

   PROCEDURE Variant_both_cases(ch     : IN character;
                                Equ_ch : IN character) IS
   Int_ch : integer := character'pos(ch);
   BEGIN                                    
      Load_variant(Int_ch, character'pos(Equ_ch), Accented);
      IF Case_significant THEN
         Load_variant(Int_ch + 32, Int_ch, Case_diff);
      ELSE
         Load_variant(Int_ch + 32, Int_ch, Exact);
      END IF;
   END Variant_both_cases;
                      
   PROCEDURE Variant_both_cases(ch      : IN character;       
                                Equ_str : IN string) IS
   Int_ch : integer := character'pos(ch);
   BEGIN
      Load_variant(ch, Equ_str);
      IF Case_significant THEN
         Load_variant(Int_ch + 32, Int_ch, Case_diff);
      ELSE
         Load_variant(Int_ch + 32, Int_ch, Exact);
      END IF;
   END Variant_both_cases;
                                  
-- Internal procedure for transscribing numbers
   PROCEDURE Get_number(Str    : IN     string;
                        Str_ix : IN OUT integer;
                        Number : OUT    integer) IS
   -- Assume Str(Str_ix) is a number. Read as long there are numbers.
   -- Leave Str_ix at the last number character.
   No_in_str : natural := 0;
   ch        : integer := character'pos(Str(Str_ix));       
   BEGIN
      WHILE ch IN Numbers LOOP 
         No_in_str := 10 * No_in_str + ch - Numbers'first;
         IF Str_ix + 1 IN Str'range THEN
            Str_ix := Str_ix + 1;      
            ch := character'pos(Str(Str_ix));      
         ELSE
            ch := 0;
         END IF;
      END LOOP;
      Number := No_in_str;
   EXCEPTION
      WHEN Numeric_error => RAISE Transscription_error;
   END;            
   
-- Exported transscription operations
   PROCEDURE Transscribe(ch        : IN  character;
                         Trans_str : OUT Transscripted_string) IS
   BEGIN                           
      Transscribe( (1 => ch), Trans_str);
   END Transscribe;

   
   PROCEDURE Transscribe(Str       : IN  string;
                         Trans_str : OUT Transscripted_string) IS
   -- Transscribe Str using the table. If the transscription does  
   -- not fit into the out parameter, raise Transscription_error.
   -- Characters in Str that are not defined are regarded as non-letters.
   -- Non-letters are always stored at the their index in Str. 
   -- Numbers are stored specially.
   ch        : natural;       -- Current character;                  
   Tr_ix     : natural := 0;  -- Index in Trans_str except the non-letter part.
   Str_ix    : integer := Str'first;  -- Index in Str and non-letter part.
   No_in_str : natural;
   BEGIN            
      WHILE Str_ix IN Str'range LOOP
         ch := character'pos(Str(Str_ix));
         IF Char_table(ch) /= NULL THEN
            IF Tr_ix + Char_table(ch).Length > Trans_str.Max_length THEN 
               RAISE Transscription_error;
            END IF;                                                         
            IF ch NOT IN Numbers OR Char_table(ch).Accent(1) /= 0 THEN
               FOR i IN 1..Char_table(ch).Length LOOP
                  Tr_ix := Tr_ix + 1;
                  Trans_str.Alphabetic(Tr_ix) := Char_table(ch).Alphabetic(i);
                  Trans_str.Case_part(Tr_ix)  := Char_table(ch).Case_variant;
                  Trans_str.Accents(Tr_ix)    := Char_table(ch).Accent(i);
               END LOOP;
            ELSE 
               Get_number(Str, Str_ix, No_in_str);
               Tr_ix := Tr_ix + 1;
               Trans_str.Alphabetic(Tr_ix) := 1000 + No_in_str;
            END IF;
         ELSE
            IF Str_ix > Trans_str.Max_length THEN
               RAISE Transscription_error;
            END IF;
            Trans_str.Non_letters(Str_ix) := ch;
            Trans_str.Non_letter_length   := Str_ix;
         END IF;   
         Str_ix := Str_ix + 1;
      END LOOP;
      Trans_str.Length := Tr_ix;
   END Transscribe;

-- Internal comparison routines      

   FUNCTION Relation(Left, Right : Natural_string) RETURN Relation_type IS
   -- This function is more os less obsolete. "<" etc should do the job.
   -- Verdix Ada can't this on integer arrays, unfortunately.
   i   : positive := 1;
   Bug : EXCEPTION; -- Should not occur
   BEGIN
      WHILE (i <= Left'last AND i <= Right'last) AND THEN 
            Left(i) = Right(i) LOOP
         i := i + 1;
      END LOOP;
      IF i > Left'last AND i > Right'last THEN
         RETURN Equal;
      ELSIF i > Left'last THEN
         RETURN Less_than;
      ELSIF i > Right'last THEN
         RETURN Greater_than;
      ELSIF Left(i) < Right(i) THEN
         RETURN Less_than;
      ELSIF Left(i) > Right(i) THEN                            
         RETURN Greater_than;
      ELSE
         RAISE Bug;   -- This should not occur.
      END IF;
   END Relation;


   FUNCTION Relation(Left, Right : Transscripted_string) RETURN Relation_type IS
   -- Compare the parts in order. Continue as long as there is unequallity.
   Rel : Relation_type;
   BEGIN                                                       
      Rel := Relation(Left.Alphabetic(1..Left.Length), 
                      Right.Alphabetic(1..Right.Length));
      IF Rel /= Equal THEN
         RETURN Rel;
      END IF;
      Rel := Relation(Left.Accents(1..Left.Length), 
                      Right.Accents(1..Right.Length));
      IF Rel /= Equal THEN
         RETURN Rel;
      END IF;
      Rel := Relation(Left.Non_letters(1..Left.Non_letter_length), 
                      Right.Non_letters(1..Right.Non_letter_length));
      IF Rel /= Equal THEN
         RETURN Rel;
      END IF;
      IF Left.Case_part(1..Left.Length) < 
         Right.Case_part(1..Right.Length) THEN  
         RETURN Less_than;
      ELSIF Left.Case_part(1..Left.Length) >
            Right.Case_part(1..Right.Length) THEN
         RETURN Greater_than;
      ELSE
         RETURN Equal;
      END IF;
   END Relation;
                     
-- Exported comparison operators
   FUNCTION "<=" (Left, Right : Transscripted_string) RETURN boolean IS
   BEGIN
      RETURN Relation(Left, Right) /= Greater_than;            
   END;
   
   FUNCTION "<"  (Left, Right : Transscripted_string) RETURN boolean IS
   BEGIN
      RETURN Relation(Left, Right) = Less_than;
   END;
   
   FUNCTION ">=" (Left, Right : Transscripted_string) RETURN boolean IS
   BEGIN
      RETURN Relation(Left, Right) /= Less_than;
   END;
   
   FUNCTION ">"  (Left, Right : Transscripted_string) RETURN boolean IS
   BEGIN
      RETURN Relation(Left, Right) = Greater_than;
   END;

END string_comparison;
SHAR_EOF
fi
if test -f 'strcomps.a'
then
	echo shar: "will not over-write existing file 'strcomps.a'"
else
cat << \SHAR_EOF > 'strcomps.a'
----------------------------------------------------------------------
--                 SPECIFCATION String_comparison                   --
----------------------------------------------------------------------
-- This package provides operations for comparing strings according to 
-- a user-defined scheme.
-- The package contains operations for load an internal coding table, 
-- routines for coding strings and for comparing coded strings.
PACKAGE String_comparison IS

   -- Load a character as the next in the primary colltating sequence
   PROCEDURE Load_alphabetic(ch : IN character);
   PROCEDURE Alpha_both_cases(ch : IN character);  
   
   -- Load a variant of a character in the main sequence, on accent
   -- level, on case level or as exactly the same.
   TYPE Equivalence_kind IS (Exact, Case_diff, Accented);   
   PROCEDURE Load_variant(ch       : IN character;  
                          Equ_ch   : IN character;
                          Equ_kind : IN Equivalence_kind);
   -- The three below always load on accent level.
   PROCEDURE Load_variant(ch      : IN character;  
                          Equ_str : IN string);  
   PROCEDURE Variant_both_cases(ch     : IN character;
                                Equ_ch : IN character);
   PROCEDURE Variant_both_cases(ch      : IN character;       
                                Equ_str : IN string);
                         
   -- Exceptions that can be raised by the load operations
   Undefined_equivalent : EXCEPTION;
   Already_defined      : EXCEPTION;

   -- Change case significance when loading both cases. Default is off.
   PROCEDURE Set_case_significance(Flag : boolean);
                                                       
   -- Transscript type and coding operations
   TYPE Transscripted_string(Max_length : natural) IS PRIVATE;
   PROCEDURE Transscribe(ch        : IN character;
                         Trans_str : OUT Transscripted_string);
   PROCEDURE Transscribe(Str       : IN string;
                         Trans_str : OUT Transscripted_string);
   Transscription_error : EXCEPTION;

   -- Comparison operators
   FUNCTION "<=" (Left, Right : Transscripted_string) RETURN boolean;
   FUNCTION "<"  (Left, Right : Transscripted_string) RETURN boolean;
   FUNCTION ">=" (Left, Right : Transscripted_string) RETURN boolean;
   FUNCTION ">"  (Left, Right : Transscripted_string) RETURN boolean;

   -- Others
   FUNCTION Is_letter(ch : character) RETURN boolean;
   
PRIVATE            
   TYPE Natural_string IS ARRAY(integer RANGE <>) OF natural;
   TYPE Boolean_string IS ARRAY(integer RANGE <>) OF boolean;
   TYPE Transscripted_string(Max_length : natural) IS
   RECORD
      Length            : natural := 0;
      Alphabetic        : Natural_string(1..Max_length) := (OTHERS => 0);
      Accents           : Natural_string(1..Max_length) := (OTHERS => 0);
      Case_part         : Boolean_string(1..Max_length) := (OTHERS => false);
      Non_letter_length : natural := 0;
      Non_letters       : Natural_string(1..Max_length) := (OTHERS => 256);
   END RECORD;
END String_comparison;
SHAR_EOF
fi
exit 0
#	End of shell archive