[sci.lang] A customable string-comparison package

sommar@enea.UUCP (10/31/87)

<Note that this is cross-posted. Please submit follow-ups to 
 comp.std.internat only>
 
Some months ago I asked for a new character concept where
the result of string comparisons should depend on the 
selected langauges. I stated that the old ASCII concept 
one character <=> one collation value must go away. 
  I have completed a string-comparison package, that performs
much of what I asked for. (I still think that the OS and
the compiler should do this for me.) The package itself have
I posted to comp.sources.misc. I have included parts of the 
READ ME file below. Besides the package itself, the posting 
contains definition files for names of non-ASCII characters in 
ISO/Latin-1 and national 7-bit ASCII replacements. There is also 
a little demonstration program, reading lines from standard input 
and writing them sorted to standard output.
  The package is written in Ada, so if you don't have an Ada
compiler, the practical use for it may be little, however the
main purpose is rather to demonstrate the idea itself.

----------------------------------------------------------------------
Four-dimensional sorting
------------------------
       
The string-comparison package compares strings at four levels:
1) Alphabetic
2) Accents
3) Non-letters
4) Difference in case 
What is an alphabetic etc is up to the user. He may define "$" 
being a letter with "(" as its lowercase variant if he likes. 

One level is only regarded if the level above have no difference.
As an example I take 
      T^ete-`a-t^ete
(I assume a "normal" loading of the character table here.)
  For the first level we use TETEATETE, thus we remove the accents
and the hyphens. On the next we re-insert the accents so we get
      T^ETE`AT^ETE
On level three we only take the hyphens in regard. When comparing
non-letters the package uses the simple ASCII values. The earlier
a character comes, the lower is the sort value. Thus, "trans-scription"
will precede "transscrip-tion". (Actually, as the implementation 
is done, the position is more important than the ASCII value.)
  On the last level we use 
    T^ete`at^ete
thus, the original writing with the hyphens removed. Note that the
user can specify case to be insigificant.
  (This isn't a description on how the package is implemented, just 
a way of illustrating the result. In practice it's done a little
more effective.)

When defining accented variants it is possible to let a character
be a variant of a string, in this way the AE ligature can be sorted
as "AE". The opposite is not possible, and what worse is, a string
can't have an alphabetic value. Thus the package is not able to sort
languages as Spanish (CH and LL) correctly.

The number characters are handled in a special way if you define them 
as alphabetics. A sequence of figures will read as one number and sort 
after all other alphabetics. (Even if they were defined as the first 
characters.) So you will get
   File1   File2   File10   File11
instead of the usual
   File1   File10  File11   File2
  If you like to sort them as they are read, this is also possible.
E.g. load "0" as a variant of "zero".

The package contains the following routines:

Load Operations
---------------
PROCEDURE Load_alphabetic(ch : IN character);
Loads ch as the next alphabetic character. The order of loading
determines the sorting values.

PROCEDURE Load_variant(ch       : IN character;  
                       Equ_ch   : IN character;
                       Equ_kind : IN Equivalence_kind);
TYPE Equivalence_kind IS (Exact, Case_diff, Accented);   
PROCEDURE Load_variant(ch      : IN character;  
                       Equ_str : IN string);  
Load_variant loads ch as a variant of Equ_ch or Equ_str. The interpretation
of Equ_kind is:
Exact: Exactly the same. There is no difference. What you use when you
       don't want case to be significant.
Case_diff: Load ch as a lowercase variant of Equ_ch. There will be
           difference at level 4.
Accented:  Load ch as variant of Equ_ch at level 2.
The latter version of Load_variant always loads ch at level 2.

For simplify loading, the package also provides routines for loading
a character and its ASCII lowercase equivalent simultaneously:
PROCEDURE Set_case_significance(Flag : boolean);
PROCEDURE Alpha_both_cases(ch : IN character);  
PROCEDURE Variant_both_cases(ch     : IN character;
                             Equ_ch : IN character);
PROCEDURE Variant_both_cases(ch      : IN character;       
                             Equ_str : IN string);
With Set_case_significant you determine whether case should be
significant when loading the pairs. Variant_both_cases loads ch
at level 2.

The loading operations raise Already_defined if an attempt is
made to load a character twice. If Equ_ch or part of Equ_str is
undefined, this gives the exception Undefined_equivalent.

Transscription operations
-------------------------
These routines translates a string to the internal coding. 
TYPE Transscripted_string(Max_length : natural) IS PRIVATE;
PROCEDURE Transscribe(ch        : IN character;
                      Trans_str : OUT Transscripted_string);
PROCEDURE Transscribe(Str       : IN string;
                      Trans_str : OUT Transscripted_string);
If the transscription is too long, the routines will raise
Transscription_error.
                      
Comparison operators:
---------------------
FUNCTION "<=" (Left, Right : Transscripted_string) RETURN boolean;
FUNCTION "<"  (Left, Right : Transscripted_string) RETURN boolean;
FUNCTION ">=" (Left, Right : Transscripted_string) RETURN boolean;
FUNCTION ">"  (Left, Right : Transscripted_string) RETURN boolean;

I have only included operations for comparing transscripted 
strings. Of course there could be a set for uncoded strings too.

Other function
--------------
FUNCTION Is_letter(ch : character) RETURN boolean;

The demonstration program
-------------------------
The program takes the options:
-8  Use ISO/Latin-1. If not present, use 7-bit ASCII with national
    replacements.
-e  Case is significant. When omitted, case is not significant.
-LX Selects language. X should be one of the following:
    s or S: Swedish. (Default)
    d or D: Danish
    g:      German1: "A, "O and "U sorts as A, O and U.
    G:      German2: "A, "O and "U sorts as AE, OE and UE.
    f or F  French
   
In the definition routine I load space as the first alphabetic
letter. This gives the result that "Smith, Tony" will sort
before "Smithson, Alan".
-- 
Erland Sommarskog       
ENEA Data, Stockholm    
sommar@enea.UUCP        
                    It could have been worse; it could have been Pepsi.