[comp.std.internat] A customable string-comparison package

sommar@enea.UUCP (10/31/87)

<Note that this is cross-posted. Please submit follow-ups to 
 comp.std.internat only>
 
Some months ago I asked for a new character concept where
the result of string comparisons should depend on the 
selected langauges. I stated that the old ASCII concept 
one character <=> one collation value must go away. 
  I have completed a string-comparison package, that performs
much of what I asked for. (I still think that the OS and
the compiler should do this for me.) The package itself have
I posted to comp.sources.misc. I have included parts of the 
READ ME file below. Besides the package itself, the posting 
contains definition files for names of non-ASCII characters in 
ISO/Latin-1 and national 7-bit ASCII replacements. There is also 
a little demonstration program, reading lines from standard input 
and writing them sorted to standard output.
  The package is written in Ada, so if you don't have an Ada
compiler, the practical use for it may be little, however the
main purpose is rather to demonstrate the idea itself.

----------------------------------------------------------------------
Four-dimensional sorting
------------------------
       
The string-comparison package compares strings at four levels:
1) Alphabetic
2) Accents
3) Non-letters
4) Difference in case 
What is an alphabetic etc is up to the user. He may define "$" 
being a letter with "(" as its lowercase variant if he likes. 

One level is only regarded if the level above have no difference.
As an example I take 
      T^ete-`a-t^ete
(I assume a "normal" loading of the character table here.)
  For the first level we use TETEATETE, thus we remove the accents
and the hyphens. On the next we re-insert the accents so we get
      T^ETE`AT^ETE
On level three we only take the hyphens in regard. When comparing
non-letters the package uses the simple ASCII values. The earlier
a character comes, the lower is the sort value. Thus, "trans-scription"
will precede "transscrip-tion". (Actually, as the implementation 
is done, the position is more important than the ASCII value.)
  On the last level we use 
    T^ete`at^ete
thus, the original writing with the hyphens removed. Note that the
user can specify case to be insigificant.
  (This isn't a description on how the package is implemented, just 
a way of illustrating the result. In practice it's done a little
more effective.)

When defining accented variants it is possible to let a character
be a variant of a string, in this way the AE ligature can be sorted
as "AE". The opposite is not possible, and what worse is, a string
can't have an alphabetic value. Thus the package is not able to sort
languages as Spanish (CH and LL) correctly.

The number characters are handled in a special way if you define them 
as alphabetics. A sequence of figures will read as one number and sort 
after all other alphabetics. (Even if they were defined as the first 
characters.) So you will get
   File1   File2   File10   File11
instead of the usual
   File1   File10  File11   File2
  If you like to sort them as they are read, this is also possible.
E.g. load "0" as a variant of "zero".

The package contains the following routines:

Load Operations
---------------
PROCEDURE Load_alphabetic(ch : IN character);
Loads ch as the next alphabetic character. The order of loading
determines the sorting values.

PROCEDURE Load_variant(ch       : IN character;  
                       Equ_ch   : IN character;
                       Equ_kind : IN Equivalence_kind);
TYPE Equivalence_kind IS (Exact, Case_diff, Accented);   
PROCEDURE Load_variant(ch      : IN character;  
                       Equ_str : IN string);  
Load_variant loads ch as a variant of Equ_ch or Equ_str. The interpretation
of Equ_kind is:
Exact: Exactly the same. There is no difference. What you use when you
       don't want case to be significant.
Case_diff: Load ch as a lowercase variant of Equ_ch. There will be
           difference at level 4.
Accented:  Load ch as variant of Equ_ch at level 2.
The latter version of Load_variant always loads ch at level 2.

For simplify loading, the package also provides routines for loading
a character and its ASCII lowercase equivalent simultaneously:
PROCEDURE Set_case_significance(Flag : boolean);
PROCEDURE Alpha_both_cases(ch : IN character);  
PROCEDURE Variant_both_cases(ch     : IN character;
                             Equ_ch : IN character);
PROCEDURE Variant_both_cases(ch      : IN character;       
                             Equ_str : IN string);
With Set_case_significant you determine whether case should be
significant when loading the pairs. Variant_both_cases loads ch
at level 2.

The loading operations raise Already_defined if an attempt is
made to load a character twice. If Equ_ch or part of Equ_str is
undefined, this gives the exception Undefined_equivalent.

Transscription operations
-------------------------
These routines translates a string to the internal coding. 
TYPE Transscripted_string(Max_length : natural) IS PRIVATE;
PROCEDURE Transscribe(ch        : IN character;
                      Trans_str : OUT Transscripted_string);
PROCEDURE Transscribe(Str       : IN string;
                      Trans_str : OUT Transscripted_string);
If the transscription is too long, the routines will raise
Transscription_error.
                      
Comparison operators:
---------------------
FUNCTION "<=" (Left, Right : Transscripted_string) RETURN boolean;
FUNCTION "<"  (Left, Right : Transscripted_string) RETURN boolean;
FUNCTION ">=" (Left, Right : Transscripted_string) RETURN boolean;
FUNCTION ">"  (Left, Right : Transscripted_string) RETURN boolean;

I have only included operations for comparing transscripted 
strings. Of course there could be a set for uncoded strings too.

Other function
--------------
FUNCTION Is_letter(ch : character) RETURN boolean;

The demonstration program
-------------------------
The program takes the options:
-8  Use ISO/Latin-1. If not present, use 7-bit ASCII with national
    replacements.
-e  Case is significant. When omitted, case is not significant.
-LX Selects language. X should be one of the following:
    s or S: Swedish. (Default)
    d or D: Danish
    g:      German1: "A, "O and "U sorts as A, O and U.
    G:      German2: "A, "O and "U sorts as AE, OE and UE.
    f or F  French
   
In the definition routine I load space as the first alphabetic
letter. This gives the result that "Smith, Tony" will sort
before "Smithson, Alan".
-- 
Erland Sommarskog       
ENEA Data, Stockholm    
sommar@enea.UUCP        
                    It could have been worse; it could have been Pepsi.

irf@kuling.UUCP (10/31/87)

In article <2428@enea.UUCP> sommar@enea.UUCP(Erland Sommarskog) writes:
>
><Note that this is cross-posted. Please submit follow-ups to 
> comp.std.internat only>
> 
>Some months ago I asked for a new character concept where
>the result of string comparisons should depend on the 
>selected langauges. I stated that the old ASCII concept 
>one character <=> one collation value must go away. 
>  I have completed a string-comparison package, that performs
>much of what I asked for. (I still think that the OS and
>( the rest has been deleted ...)

If I understand correctly it seems to me you're trying to reinvent
the wheel. Isn't this what NLS (HP-UX Native Language System, now accepted
as standard by x/open) is doing plus a host of other nifty things like
automatically changing from the Anglo-Saxon 13 hour clock to the 24 hour
one, issuing error messages in your own native language instead of English,
taking care of collating sequences (e.g., treating 'll' and 'ch' in a special
way in Spanish) and so forth?


::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
Bo Thide', Swedish Institute of Space Physics.  UUCP: ...enea!kuling!irfu!bt
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

greger@ism780b (Greger Leijonhufvud) (11/05/87)

In article <2428@enea.UUCP> sommar@enea.UUCP(Erland Sommarskog) writes:
>
><Note that this is cross-posted. Please submit follow-ups to 
> comp.std.internat only>
> 
>Some months ago I asked for a new character concept where
>the result of string comparisons should depend on the 
>selected langauges. I stated that the old ASCII concept 
>one character <=> one collation value must go away. 
 

I hope Erland and others are aware of the work done in the ANSI X3.11 and POSIX
organizations. The problem with string compare was extensively
discussed during the "final" phase of X3.J11 and especially in
the Internationalization "subcommittee".

The proposal (which, hopefully, will become a full standard) identifies
two specific new library functions intended to provide support for
collation which is not dependent on the physical encoding. Both
are dependent on some (user-selectable) external information on
the desired collation sequence.

The two functions are: strcoll(3,C) and strxfrm(3C). They differ
in that strcoll performs a compare of two items (as strcmp) according
to the desired collation order, while strxfrm transforms the string
according to the external information such that a subsequent strcmp
using the "native" collation can be performed.

Strcoll is useful in occasional compares, while strxfrm is intended
for repeated compares, as in a sort (the table-driven compares are
qite slow, compared to the native compare, so a pre-transformation
before the actual sorting is quite advantageous).

Strcoll is also supported in the X/OPEN specifications, as nl_strcmp
and nl_strncmp.

Recently, the /usr/group Internationalizarion committee has made some
proposals to POSIX P1003.2 (commands & utilities) in the area of
regular expressions which draw heavily on these facilities.

In all cases, the collation order allows the "user" (actually, this
is more of an administrator type of job) to specify a collation
order in which the ordering is independent of character values.
In addition, the user can specify

1.  that a string of characters sort as one (example: Spanish ch and
    (ll),

2.  that one character sorts as a string (example:  German duble s),

3.  that several characters can have the same collation order (example:
    accented e's sort with unaccented e),

4.  that, if two strings containing such "equivalent" characters
    collate equal, then the order between them depends on a "secondary"
    collation value.

5.  that characters can be designated as "don't care", i.e. are
    disregarded when comparing.

As can be seen, this does change the collation from character-oriented
to string-oriented.

And finally, there are several UNIX systems on the market (notably,
the X/OPEN ones, inluding HP's, and one from IBM) which does provide
this functionality.

If there is an interest, I am more than happy to post more elaborate
descriptions of these thinks to the net.
 ------
Greger Leijonhufvud
INTERACTIVE Systems Corporation
Santa Monica, CA. 90404

"The above views does not represent
 anything but mine own..."
 				Reverse the polarity of the neutron flow!