sommar@enea.UUCP (10/31/87)
<Note that this is cross-posted. Please submit follow-ups to comp.std.internat only> Some months ago I asked for a new character concept where the result of string comparisons should depend on the selected langauges. I stated that the old ASCII concept one character <=> one collation value must go away. I have completed a string-comparison package, that performs much of what I asked for. (I still think that the OS and the compiler should do this for me.) The package itself have I posted to comp.sources.misc. I have included parts of the READ ME file below. Besides the package itself, the posting contains definition files for names of non-ASCII characters in ISO/Latin-1 and national 7-bit ASCII replacements. There is also a little demonstration program, reading lines from standard input and writing them sorted to standard output. The package is written in Ada, so if you don't have an Ada compiler, the practical use for it may be little, however the main purpose is rather to demonstrate the idea itself. ---------------------------------------------------------------------- Four-dimensional sorting ------------------------ The string-comparison package compares strings at four levels: 1) Alphabetic 2) Accents 3) Non-letters 4) Difference in case What is an alphabetic etc is up to the user. He may define "$" being a letter with "(" as its lowercase variant if he likes. One level is only regarded if the level above have no difference. As an example I take T^ete-`a-t^ete (I assume a "normal" loading of the character table here.) For the first level we use TETEATETE, thus we remove the accents and the hyphens. On the next we re-insert the accents so we get T^ETE`AT^ETE On level three we only take the hyphens in regard. When comparing non-letters the package uses the simple ASCII values. The earlier a character comes, the lower is the sort value. Thus, "trans-scription" will precede "transscrip-tion". (Actually, as the implementation is done, the position is more important than the ASCII value.) On the last level we use T^ete`at^ete thus, the original writing with the hyphens removed. Note that the user can specify case to be insigificant. (This isn't a description on how the package is implemented, just a way of illustrating the result. In practice it's done a little more effective.) When defining accented variants it is possible to let a character be a variant of a string, in this way the AE ligature can be sorted as "AE". The opposite is not possible, and what worse is, a string can't have an alphabetic value. Thus the package is not able to sort languages as Spanish (CH and LL) correctly. The number characters are handled in a special way if you define them as alphabetics. A sequence of figures will read as one number and sort after all other alphabetics. (Even if they were defined as the first characters.) So you will get File1 File2 File10 File11 instead of the usual File1 File10 File11 File2 If you like to sort them as they are read, this is also possible. E.g. load "0" as a variant of "zero". The package contains the following routines: Load Operations --------------- PROCEDURE Load_alphabetic(ch : IN character); Loads ch as the next alphabetic character. The order of loading determines the sorting values. PROCEDURE Load_variant(ch : IN character; Equ_ch : IN character; Equ_kind : IN Equivalence_kind); TYPE Equivalence_kind IS (Exact, Case_diff, Accented); PROCEDURE Load_variant(ch : IN character; Equ_str : IN string); Load_variant loads ch as a variant of Equ_ch or Equ_str. The interpretation of Equ_kind is: Exact: Exactly the same. There is no difference. What you use when you don't want case to be significant. Case_diff: Load ch as a lowercase variant of Equ_ch. There will be difference at level 4. Accented: Load ch as variant of Equ_ch at level 2. The latter version of Load_variant always loads ch at level 2. For simplify loading, the package also provides routines for loading a character and its ASCII lowercase equivalent simultaneously: PROCEDURE Set_case_significance(Flag : boolean); PROCEDURE Alpha_both_cases(ch : IN character); PROCEDURE Variant_both_cases(ch : IN character; Equ_ch : IN character); PROCEDURE Variant_both_cases(ch : IN character; Equ_str : IN string); With Set_case_significant you determine whether case should be significant when loading the pairs. Variant_both_cases loads ch at level 2. The loading operations raise Already_defined if an attempt is made to load a character twice. If Equ_ch or part of Equ_str is undefined, this gives the exception Undefined_equivalent. Transscription operations ------------------------- These routines translates a string to the internal coding. TYPE Transscripted_string(Max_length : natural) IS PRIVATE; PROCEDURE Transscribe(ch : IN character; Trans_str : OUT Transscripted_string); PROCEDURE Transscribe(Str : IN string; Trans_str : OUT Transscripted_string); If the transscription is too long, the routines will raise Transscription_error. Comparison operators: --------------------- FUNCTION "<=" (Left, Right : Transscripted_string) RETURN boolean; FUNCTION "<" (Left, Right : Transscripted_string) RETURN boolean; FUNCTION ">=" (Left, Right : Transscripted_string) RETURN boolean; FUNCTION ">" (Left, Right : Transscripted_string) RETURN boolean; I have only included operations for comparing transscripted strings. Of course there could be a set for uncoded strings too. Other function -------------- FUNCTION Is_letter(ch : character) RETURN boolean; The demonstration program ------------------------- The program takes the options: -8 Use ISO/Latin-1. If not present, use 7-bit ASCII with national replacements. -e Case is significant. When omitted, case is not significant. -LX Selects language. X should be one of the following: s or S: Swedish. (Default) d or D: Danish g: German1: "A, "O and "U sorts as A, O and U. G: German2: "A, "O and "U sorts as AE, OE and UE. f or F French In the definition routine I load space as the first alphabetic letter. This gives the result that "Smith, Tony" will sort before "Smithson, Alan". -- Erland Sommarskog ENEA Data, Stockholm sommar@enea.UUCP It could have been worse; it could have been Pepsi.
irf@kuling.UUCP (10/31/87)
In article <2428@enea.UUCP> sommar@enea.UUCP(Erland Sommarskog) writes: > ><Note that this is cross-posted. Please submit follow-ups to > comp.std.internat only> > >Some months ago I asked for a new character concept where >the result of string comparisons should depend on the >selected langauges. I stated that the old ASCII concept >one character <=> one collation value must go away. > I have completed a string-comparison package, that performs >much of what I asked for. (I still think that the OS and >( the rest has been deleted ...) If I understand correctly it seems to me you're trying to reinvent the wheel. Isn't this what NLS (HP-UX Native Language System, now accepted as standard by x/open) is doing plus a host of other nifty things like automatically changing from the Anglo-Saxon 13 hour clock to the 24 hour one, issuing error messages in your own native language instead of English, taking care of collating sequences (e.g., treating 'll' and 'ch' in a special way in Spanish) and so forth? :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: Bo Thide', Swedish Institute of Space Physics. UUCP: ...enea!kuling!irfu!bt ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
greger@ism780b (Greger Leijonhufvud) (11/05/87)
In article <2428@enea.UUCP> sommar@enea.UUCP(Erland Sommarskog) writes: > ><Note that this is cross-posted. Please submit follow-ups to > comp.std.internat only> > >Some months ago I asked for a new character concept where >the result of string comparisons should depend on the >selected langauges. I stated that the old ASCII concept >one character <=> one collation value must go away. I hope Erland and others are aware of the work done in the ANSI X3.11 and POSIX organizations. The problem with string compare was extensively discussed during the "final" phase of X3.J11 and especially in the Internationalization "subcommittee". The proposal (which, hopefully, will become a full standard) identifies two specific new library functions intended to provide support for collation which is not dependent on the physical encoding. Both are dependent on some (user-selectable) external information on the desired collation sequence. The two functions are: strcoll(3,C) and strxfrm(3C). They differ in that strcoll performs a compare of two items (as strcmp) according to the desired collation order, while strxfrm transforms the string according to the external information such that a subsequent strcmp using the "native" collation can be performed. Strcoll is useful in occasional compares, while strxfrm is intended for repeated compares, as in a sort (the table-driven compares are qite slow, compared to the native compare, so a pre-transformation before the actual sorting is quite advantageous). Strcoll is also supported in the X/OPEN specifications, as nl_strcmp and nl_strncmp. Recently, the /usr/group Internationalizarion committee has made some proposals to POSIX P1003.2 (commands & utilities) in the area of regular expressions which draw heavily on these facilities. In all cases, the collation order allows the "user" (actually, this is more of an administrator type of job) to specify a collation order in which the ordering is independent of character values. In addition, the user can specify 1. that a string of characters sort as one (example: Spanish ch and (ll), 2. that one character sorts as a string (example: German duble s), 3. that several characters can have the same collation order (example: accented e's sort with unaccented e), 4. that, if two strings containing such "equivalent" characters collate equal, then the order between them depends on a "secondary" collation value. 5. that characters can be designated as "don't care", i.e. are disregarded when comparing. As can be seen, this does change the collation from character-oriented to string-oriented. And finally, there are several UNIX systems on the market (notably, the X/OPEN ones, inluding HP's, and one from IBM) which does provide this functionality. If there is an interest, I am more than happy to post more elaborate descriptions of these thinks to the net. ------ Greger Leijonhufvud INTERACTIVE Systems Corporation Santa Monica, CA. 90404 "The above views does not represent anything but mine own..." Reverse the polarity of the neutron flow!