DIAMOND.JON%forum.va.gov (04/11/91)
Alternate Collating Sequences|TAB||TAB|X11/SC1/TG1/91-3 Proposer:|TAB|Jon Diamond, Hoskyns Group 2. |TAB|JUSTIFICATION 2.1|TAB|Needs The major need is to be able to define collating sequences for processing text in non-English languages. Currently there is only one mechanism for collating sequences. See separate document for further information. 2.2|TAB|Existing practice in the area of the proposed change As far as is known there are no current implementations which allow for any capability for user-definability of collating sequences, or extandability above those defined by the implementation. One implementation (GSM) does allow for the storage of collation information in an internal table, which is then accessible to be called as an intrinsic function by application programmers whenever a storage reference is to be made. Some applications store data in different collating sequences, dependent on language, but these also require explicit programming to call the code to convert a string to a collation number before storing in the database. 3.|TAB|DESCRIPTION 3.1|TAB|General description of the proposed change The proposal is that a facility for a collating algorithm be added to those elements of the language which use collating in some way. The reason for the use of an algorithm rather than character ordering is given in an associated document on Natural Language Handling. The main items that need this are globals and the process/system. The algorithm takes a single string value and produces a collating value which can then be used for storage reference or comparison purposes. If a multi-level global is being referenced then the algorithm will be applied for the each level of the global reference (but see under resolved issues). It is to be understood that two distinct strings may produce the same collating value. The process-specific collation-algorithm is understood to apply to the follows operator. All subscripted references to a global (local?) variable will need to be processed via the collating algorithm. Unlike the pattern match proposal no general collating algorithm ssvn (eg ^$COLLATE) has been created to hold possible algorithms. This is a possibility, but needs further discussion as to whether this is the best direction to go on this collation issue. Since the creation of a global necessitates an understanding of the collation algorithm to be used it is considered imperative that the algorithm either be stored with or in an associated area attached to a global. In this proposal this is in the ssvn ^$GLOBAL. It should be noted that the references to a specific global within this ssvn are implicitly part of the definition of a global and need to be associated with any interchange of this global with another system. NOTE It should be noted that many currently existing systems have two collation algorithms (string-collating and numeric-collating) in use, which although not user-definable, have many of the system-oriented properties that this collation algorithm system has. It should further be noted that the usage of string collating sequence implies that the application is non-portable, since this sequence is prohibited by the ANSI Standard. 3.3|TAB|Formalization In section 2.2.7.11 $ORDER add after part d. of the definition of the ordering sequence "where ] is interpreted using ^$GLOBAL(Name,"COLLATE"), if it exists, rather than ^$JOB($JOB,"COLLATE") (See section 2.3.2.2)." In section 2.2.7.11 $ORDER in the paragraph which starts "In words" replace "by the conventional ASCII collating sequence" by "by the current collating sequence for the glvn" In section 2.3.2.2 in the definition of the relation ] (paragraph 4) replace "in the conventional ASCII collating sequence" with "the current collating sequence" In section 2.3.2.2 in the definition of the relation ] (paragraph 4) insert a new paragraph after the sentence ending "defined here." "This collating sequence is specified below with the aid of a function, CA, which is used for definitional purposes only, to establish the collating sequence. CA(s) is defined for string s as follows: |TAB|If $DATA(^$JOB($JOB,"COLLATE"))=0 or ^$JOB($JOB,"COLLATE")="" then CA(s) returns s |TAB|Otherwise CA(s) returns @(^$JOB("$JOB,"COLLATE")_"("_s_")") In words, it either returns the original string or the result of calling the function reference in ^$JOB($JOB,"COLLATE") using s as the sole parameter, if such a function reference exists. Note that the function reference must start with $$ to call an extrinsic function." In section 2.3.2.2 in the definition of the relation ] (paragraph 4) replace in the definitions b. and c. all occurences of "A" and "B" with "CA(A)" and "CA(B)" respectively. In section 2.3.2.2 in the definition of the relation ] (paragraph 4) replace in the definitions of b. "ASCII code" with "$ASCII value" Add another section (not sure where to) x. ssvn semantics The following ssvns are defined:- ^$GLOBAL(name,"COLLATE") = collation algorithm reference The SET command can be used to assign a value to this ssvn. The use of the SET command after the first node, except for the unsubscripted name node, the first node in a global is created then the following commands are implicitly executed by the MUMPS system:- |TAB|SET ^$GLOBAL(name,"COLLATE")=^$JOB($JOB,"COLLATE") The KILL command can is only allowed to KILL the top level node, ie ^$GLOBAL(name). ^$JOB($J,"COLLATE") = current collation algorithm reference (The remaining text from X11/SC7/TG1/91-3) section 3.3 also applies to this entry in ^$JOB.) 3.4 |TAB|Unresolved Issues One of the main unresolved issues is the format of the collation algorithm. This cannot be specified as a simple table, for reasons expressed elsewhere, but preferably would be MUMPS code. It might be easier if the entries in the various ssvns etc are pointers to another table where the collation algorithms are stored. Another problem is that of reference to levels in globals. In many systems we want to apply a special collation to only the first, or specified levels, in a global. One problem might be solved if we could specify a defaulting system that took canonic numbers and collated them differently. A third problem is that any process-specific collation should apply to local variables as well, and we are not proposing that we attach a collation algorithm to locals. However if we don't what is going to be the effect to existing subscripted local variables if we change the process-specific collation algorithm? It was stated above that two distinct strings may produce the same collation value after the collation algorithm has been applied. This could cause some problems from applications which would assume that they would be stored using different subscripts. Further thought about whether this is a real issue or not is needed. Finally, another modification will be required to the Collates after proposal (]]) which has been under discussion with SC1 for some time, should this proposal be passed. -- Hokey We are Space Guys. We know what we are doing.