[comp.std.mumps] Collation

DIAMOND.JON%forum.va.gov (04/11/91)

Alternate Collating Sequences|TAB||TAB|X11/SC1/TG1/91-3
 
Proposer:|TAB|Jon Diamond, Hoskyns Group
  
2. |TAB|JUSTIFICATION
 
2.1|TAB|Needs
 
The major need is to be able to define collating sequences for 
processing text in non-English languages. Currently there is only one 
mechanism for collating sequences. See separate document for further 
information.
 
2.2|TAB|Existing practice in the area of the proposed change
 
As far as is known there are no current implementations which allow 
for any capability for user-definability of collating sequences, or 
extandability above those defined by the implementation.
 
One implementation (GSM) does allow for the storage of collation 
information in an internal table, which is then accessible to be 
called as an intrinsic function by application programmers whenever a 
storage reference is to be made.
 
Some applications store data in different collating sequences, 
dependent on language, but these also require explicit programming to 
call the code to convert a string to a collation number before 
storing in the database.
 
 
3.|TAB|DESCRIPTION
 
3.1|TAB|General description of the proposed change
 
The proposal is that a facility for a collating algorithm be added to 
those elements of the language which use collating in some way. The 
reason for the use of an algorithm rather than character ordering is 
given in an associated document on Natural Language Handling. The 
main items that need this are globals and the process/system. The 
algorithm takes a single string value and produces a collating value 
which can then be used for storage reference or comparison purposes. 
If a multi-level global is being referenced then the algorithm will 
be applied for the each level of the global reference (but see under 
resolved issues). It is to be understood that two distinct strings 
may produce the same collating value.
 
The process-specific collation-algorithm is understood to apply to 
the follows operator. All subscripted references to a global (local?) 
variable will need to be processed via the collating algorithm.
 
Unlike the pattern match proposal no general collating algorithm ssvn 
(eg ^$COLLATE) has been created to hold possible algorithms. This is 
a possibility, but needs further discussion as to whether this is the 
best direction to go on this collation issue.
 
Since the creation of a global necessitates an understanding of the 
collation algorithm to be used it is considered imperative that the 
algorithm either be stored with or in an associated area attached to 
a global. In this proposal this is in the ssvn ^$GLOBAL. It should be 
noted that the references to a specific global within this ssvn are 
implicitly part of the definition of a global and need to be 
associated with any interchange of this global with another system.
 
NOTE It should be noted that many currently existing systems have two 
collation algorithms (string-collating and numeric-collating) in use, 
which although not user-definable, have many of the system-oriented 
properties that this collation algorithm system has. It should 
further be noted that the usage of string collating sequence implies 
that the application is non-portable, since this sequence is 
prohibited by the ANSI Standard.
 
3.3|TAB|Formalization
 
In section 2.2.7.11 $ORDER add after part d. of the definition of the 
ordering sequence
 
"where ] is interpreted using ^$GLOBAL(Name,"COLLATE"), if it exists,
rather than ^$JOB($JOB,"COLLATE") (See section 2.3.2.2)."
 
In section 2.2.7.11 $ORDER in the paragraph which starts "In words" 
replace
 
"by the conventional ASCII collating sequence"
 
by
 
"by the current collating sequence for the glvn"
 
In section 2.3.2.2 in the definition of the relation ] (paragraph 4) 
replace
 
"in the conventional ASCII collating sequence"
 
with
 
"the current collating sequence"
 
In section 2.3.2.2 in the definition of the relation ] (paragraph 4) 
insert a new paragraph after the sentence ending "defined here."
 
"This collating sequence is specified below with the aid of a 
function, CA, which is used for definitional purposes only, to 
establish the collating sequence.
 
CA(s) is defined for string s as follows:
 
|TAB|If $DATA(^$JOB($JOB,"COLLATE"))=0 or 
^$JOB($JOB,"COLLATE")="" then CA(s) returns s
|TAB|Otherwise CA(s) returns @(^$JOB("$JOB,"COLLATE")_"("_s_")")
 
In words, it either returns the original string or the result of 
calling the function reference in ^$JOB($JOB,"COLLATE") using s as 
the sole parameter, if such a function reference exists. Note that 
the function reference must start with $$ to call an extrinsic 
function."
 
In section 2.3.2.2 in the definition of the relation ] (paragraph 4) 
replace in the definitions b. and c.  all occurences of
 
"A"  and "B"
 
with
 
"CA(A)" and "CA(B)"
 
respectively.
 
In section 2.3.2.2 in the definition of the relation ] (paragraph 4) 
replace in the definitions of b.
 
"ASCII code"
 
with
 
"$ASCII value"
 
Add another section (not sure where to)
 
x. ssvn semantics
 
The following ssvns are defined:-
 
^$GLOBAL(name,"COLLATE") = collation algorithm reference
 
The SET command can be used to assign a value to this ssvn. The use 
of the SET command after the first node, except for the unsubscripted 
name node, the first node in a global is created then the following 
commands are implicitly executed by the MUMPS system:-
 
|TAB|SET ^$GLOBAL(name,"COLLATE")=^$JOB($JOB,"COLLATE")
 
The KILL command can is only allowed to KILL the top level node, ie
^$GLOBAL(name).
 
^$JOB($J,"COLLATE") = current collation algorithm reference
 
(The remaining text from X11/SC7/TG1/91-3) section 3.3 also applies 
to this entry in ^$JOB.)
 
3.4 |TAB|Unresolved Issues
 
One of the main unresolved issues is the format of the collation 
algorithm. This cannot be specified as a simple table, for reasons 
expressed elsewhere, but preferably would be MUMPS code. It might be 
easier if the entries in the various ssvns etc are pointers to 
another table where the collation algorithms are stored.
 
Another problem is that of reference to levels in globals. In many 
systems we want to apply a special collation to only the first, or 
specified levels, in a global. One problem might be solved if we 
could specify a defaulting system that took canonic numbers and 
collated them differently.
 
A third problem is that any process-specific collation should apply 
to local variables as well, and we are not proposing that we attach a 
collation algorithm to locals. However if we don't what is going to 
be the effect to existing subscripted local variables if we change 
the process-specific collation algorithm?
 
It was stated above that two distinct strings may produce the same 
collation value after the
collation algorithm has been applied. This could cause some problems 
from applications which would assume that they would be stored using 
different subscripts. Further thought about whether this is a real 
issue or not is needed.
 
Finally, another modification will be required to the Collates after 
proposal (]]) which has been under discussion with SC1 for some time, 
should this proposal be passed.

-- 
Hokey				We are Space Guys.  We know what we are doing.