[comp.std.mumps] Natural Language handling

DIAMOND.JON%forum.va.gov (04/11/91)

This is the first in s set of documents about Natural Language
Handling that will be submitted to SC1/TG1 for the June
meeting and will go out in the pre-meeting mailing.
 
Assuming that the TG is reasonably happy with them then I
hope that we can get them to SC1 type B in June (as they
will be coming out of the TG).
 
I would like your comments ASAP so that I can revise them as
necessary before submitting them to Don Piccone for
the pre-meeting mailing. This really means all comments
have to be back to me by 15th April at the latest.
 
It is clear that, at the very least, the document that follows
here is incomplete in a number of areas and I will continue
working on it so that I can provide a revised version AT the
meeting for everyone there. So if you can't make the deadline above please
send your comments ASAP anyway.
 
Be warned the document that follows here is quite long...
 
Thanks in advance for your assistance.
----------------------------------------------Natural Language Handling
 
 Jon Diamond, Hoskyns Group,
 
 1.|TAB|INTRODUCTION
 
 This document is the result of discussions over a long period of 
time within the MUMPS Development Committee by a significant 
number of people. It is produced by task group 1 of Subcommittee 1 
(Language) of the MDC. It attempts to provide 
 
 -|TAB|a complete background to the issues of handling multiple 
languages for those who may be unfamiliar with the problems
 
 -|TAB|a practical solution to these problems by 
|TAB|proposing modifications to the MUMPS language
 
 |TAB|suggestions to implementors as to how these modifications 
might be implemented and 
|TAB|directions to programmers as to how to use these 
modifications in order to achieve portable multi-lingual 
applications programs
 
 -|TAB|a description of other related issues which are not part of the 
mandate of this task group, eg external character representations.
 
 By its very nature this document cannot cover all possibilities 
since the various authors do not have experience of all languages 
now in existence. It is believed that the direction is correct and 
open-ended, whilst minimising the implementation impacts. 
Feedback is, however, urgently required in order to ensure that this 
is so.
 
 
2. |TAB|JUSTIFICATION OF PROPOSED CHANGE
 2.1|TAB|Needs
 
 MUMPS is an international language, with versions already available 
for Japanese, Chinese, Russian and some other European languages. 
These versions are not yet explicitly included in the MUMPS standard 
and some conventions adopted in each language-specific 
implementation are not extensible to the multi-lingual environment. 
International acceptance of software in the English language is now 
becoming more problematic. Each language group is now insisting 
that software is provided for them in their own language and their 
own character set. If MUMPS is to continue to spread internationally 
then it must embrace these requirements and provide the basis for 
these applications to be supplied by MUMPS application providers. 
Finally, systems are now being implemented whose users will be 
accessing them in their own language and storing data, which may 
subsequent be displayed and manipulated by users in a different 
language. This is an emerging requirement which will, with the 
increasing integration of systems across Europe and the world, 
become more and more important in the future.
 
 MUMPS has the opportunity to provide leadership in the programming 
language community if it can define features that will enable it to 
process multiple character sets. The timing is also appropriate, 
since there are a number of language features under discussion 
which facilitate multi-lingual capabilities without adding a complex 
array of new language features. Furthermore since MUMPS has not 
defined a character in terms of the number of bits in which it stored 
it is anticipated that the mechanisms provided within MUMPS for 
addressing these problems will be significantly simpler than in 
other languages.
 
 2.2|TAB|Existing Practice
 
 There are several versions of MUMPS which can accommodate 
character sets other than ASCII. All of these versions, however are 
bilingual in nature, such as an English-Japanese, English-
Chines and English-Russian. Extensions in these implementations 
have been adopted to permit string operators ($FIND, $LENGTH, 
$EXTRACT, $PIECE etc) to deal with logical character lengths. In 
some of these systems these extensions accommodate character 
sets in which a single character is represented by more than one 
byte, with in general escape sequences prefixing character set 
changes. In other systems 8-bit bytes are used, but the upper 128 
characters are normally significantly different to US systems and 
even other non-US systems with the same MUMPS implementation
 
 Extensions to permit pattern match operators specific to some 
other languages have also been adopted for some languages. String 
subscript collation have also been extended to accommodate 8-bit 
and two-byte character codes.
 
 These extensions are neither a part of the MUMPS standard, nor are 
they truly multi-lingual. There is no uniformity in these extensions, 
and they do not include all the features necessary for multi-lingual 
processing. Finally, they typically provide an explicit mechanism for 
addressing many of the issues involved in multi-lingual processing. 
A mainly implicit mechanism would greatly improve the availability 
of multi-lingual applications and also minimise the errors 
introduced by the additional programming requirements.
 
 Existing practice on non-MUMPS systems is diverse. On the IBM PC 
each language has its own code page, together with a keyboard 
mapping. This gives each language a unique set of 8-bit codes and 
makes data interchange problematic. DEC have defined a multi-
national character set which operates across all their hardware 
platforms. This makes interchange easy, but creates a unique 
character set which does not encompass all requirements, eg 
Japanese. The issues of collation do not appear to have been 
satisfactorily addressed in these two environments. Unix systems 
are now beginning to address the problems, with a lot of work being 
undertaken in POSIX and X/Open on these issues. Some of this is 
referenced later in this document.
 
 2.3 |TAB|Goals
 
 The main objective of the current work is produce a framework in 
which application providers can write applications which are 
portable between systems using different languages and also be able 
to support multiple languages simultaneously. This encompasses not 
only the display of data entry screens in the terminal users own 
languages, but also the storage and manipulation of data which may 
be entered in one of several languages.
 
 It is also hoped that this will also lead to the ability to be able to 
interchange multi-lingual data between systems. It is anticipated 
that this will initially be between homogeneous systems, but 
ultimately the goal is to be able to communicate between 
heterogeneous MUMPS systems and even to non-MUMPS systems.
 
 The objective is limited in this set of proposals to the ability to be 
able to process multi-lingual information. It does not address all the 
issues related to internationalisation as it is understood in other 
systems. For example POSIX makes a definition of locale which 
includes a whole host of other issues, eg currency symbol, date 
format etc. A list of these issues is presented in Appendix A. These 
will be addressed in other papers and in other ways within the 
MUMPS language.
 
 One of the design goals is also to enable, as far as possible, 
existing applications to operate in a different language environment 
or even a multi-lingual one to that which they were created in. If 
this is achievable then a major step forward will be taken in 
reducing the cost of software written for these languages and also 
increasing the potential user base of existing applications at 
minimal cost.
 
 The major other design goal is to minimise the impact on the 
structure and syntax of the language in providing these capabilities. 
This is important in ensuring that no major additional learning time 
is needed in order to be able to use the new facilities.
 
 Finally, it is a design aim to minimise the total implementation 
effort and performance impact of these changes. The proposed 
changes will cause modifications to many if not all existing MUMPS 
implementations. Although this particular design aim is not an 
overriding criteria it will obviously be beneficial to all parties if 
the effort involved and the impact on existing systems can be 
minimised.
 
 3.|TAB|PROBLEM DEFINITION
 
 Character Sets
 
 MUMPS as it currently exists today only recognises a single 
character entity. There is no statement within the standard as to the 
number of bits which represent a character. The definition was 
carefully phrased so as to be vague as to this issue, since MUMPS 
handles these kinds of issues at a conceptual (logical) level and not 
a physical one. The only requirement from all Standard MUMPS 
systems is that the $CHARACTER function when applied to a number 
between 0 and 127 produces the equivalent character from the ASCII 
character set (X3.4-1984), with $ASCII being the reverse function 
producing a numeric value corresponding to a character. The 
implication of this requirement is that portable programs can only 
rely on 7-bit ASCII characters being available in any system. As was 
mentioned earlier there are a number of different systems using 8-
bit characters which use characters 128-255 differently.
 
 It is important to distinguish a number of different usages of 
character sets for the purposes of further discussion:
 
 1.|TAB|MUMPS Representation
 
 |TAB|This is what was discussed above. It corresponds to the 
character that is presented to the application program by the 
underlying MUMPS implementation. There is no restriction as to its 
length in bits, or currently its interpretation above the 7-bit limit.
 
 2.|TAB|Internal representation
 
 |TAB|This is the physical storage mechanism within the 
implementors systems for a character. It is conceivable that this 
might be different for locals and globals, depending on the different 
costs involved for CPU, memory, disk storage and I/O time. 
 
 |TAB|One issue that is sometimes raised, especially in other 
programming languages, is that if the character set to be used in 1) 
above is 16 or 32 bits then the storage needed will automatically be 
doubled or quadrupled. This need not necessarily be true as it is an 
implementors decisions as to the internal representation. There have 
been a number of papers presented discussing encoding schemes for 
multi-octet characters in order to reduce the storage requirements. 
One possible scheme is to use the top bit of a byte in order to 
indicate that the next byte is also part of the character. This leads 
to a storage of one byte for characters 0-127, two bytes for 128-
16383 etc.
 
 3.|TAB|Display representation
 
 |TAB|This is the physical character appearing on a display (output) 
device. Strictly speaking in MUMPS terms this corresponds to the 
code number which is sent to the device to cause a character 
display.
 
 4.|TAB|Interchange representation
 
 |TAB|This is a character or code number which is used for 
interchange between systems. Both the sending and receiving 
systems need to agree on the specific meaning of a code. In general 
terms there is no difference between this representation and the 
display representation, however display representations will be 
driven by specific devices and interchange representations by 
standards definitions.
 
 From now on the discussion will focus on the MUMPS, or logical, 
representation. When any other representation is being discussed 
this will be specifically identified.
 
 The handling of external representation of characters is not part of 
the scope of work of this task group, however a number of issues are 
obvious and some possible solutions which address the needs of 
natural languages are proposed. Clearly the issues of font, character 
size and other attributes need to be solved for the English language. 
If these general issues are solved then the handling of characters in 
natural languages will automatically be solved.
 
 Collation
 
 For natural languages the issue of the collating sequence of data is 
extremely important and one of the most difficult issues.
 
 It is obvious that with a different character set that the standard 
collation sequence that we have traditionally assumed as being 
implied by the ASCII collating sequence will have to change. What is 
not clear at first glance is how significant the changes actually 
need to be.
 
 Those languages that are most different to English in visual 
appearance, such as Japanese and Chinese, have fundamentally 
different collating sequences. For example in Japanese the main 
ordering in most dictionaries has been by number of strokes used in 
drawing the character. This ordering is not reflected in the order of 
characters described in any Japanese computer coded character set.
 
 In some ways this is simpler than the collation required for many 
Indo-European languages. Some years ago it was thought by many of 
the non-American participants within the MUMPS Development 
Committee that the usage of diacritical characters (accented ones 
such as ) could be handled by a relatively simple transform with 
two usages of the $TRANSLATE function (see X11/SC1/???). 
However recent papers by LaBont and van Wingen within the ISO 
programming language group working on international character sets 
has showed that this is not true. It appears that we need at least a 
four-pass transformation in order to be able to handle French. (It 
should also be noted at this point that the papers presented within 
this forum have shown that the collation required for French as used 
in Canada is different from that in France, similarly for German in 
Germany and Switzerland.)
 
 With this new information it is clear that our earlier simplistic 
notions are no longer able to produce the required results. Even 
looking at dictionaries and telephone directories in English shows a 
number of anomalies which cannot be easily rectified. Furthermore 
since we cannot be experts in every single language in the world if 
we are to solve the problem properly then we must generalise our 
notion of collation and call on a more general algorithmic approach.
 
 It is therefore recommended that collation be handled by specifying 
an appropriate algorithm for each language. This will calculate the 
unique collation value for a string. (The ISO work is aimed at being 
able to compare two strings and therefore can be specified 
differently. We, however, have a different requirement since we are 
assuming that the values will be used as subscripts for storage in 
globals.) It should not be assumed that this algorithm is reversible, 
ie given a collation value that the original input can be 
reconstructed. An example from English should suffice to 
demonstrate this - if we require collation that was totally case 
insensitive we would transform A and a both to a. We would then be 
unable to decide from the collation value a whether the character in 
the original string was A or a.
 
 The issue of collation affects MUMPS in two particular ways - 
subscript ordering (both for globals and locals), with the appropriate 
impact on $ORDER and $QUERY and the follows operator ]. Most of the 
discussion has been focussed on the storage issue, however a full 
solution also needs to address how these operators are also to be 
affected.
 
 There are a number of reasons for using an implicit collation 
scheme rather than an explicit schemes. The explicit one would 
require each application programme to use some kind of function 
whenever a data reference is made. This would mean changes to 
every existing programme and it would also be error prone. 
Furthermore, because of the issue that have been discussed above it 
would require at least a four level scheme, thereby increasing the 
maximum size requirement for a subscript by a factor of four, ie 
from the current 128 to 512 , and also requiring a doubling of the 
current string limit. Since the number of levels required for every 
language is not known at this time it is still not clear that this 
would address every possible language.
 
 Pattern Matching
 
 It is clear from the considerations of most languages that the basic 
pattern match codes defined in the MUMPS standard are insufficient 
and inadequate. These codes may be grouped into three classes - 
numeric (N), alphabetic (AULP) and other (CE). For the purposes of 
multi-lingual handling none of the definitions for any of these 
categories is acceptable, even though at first sight only the 
alphabetic ones might be affected.
 
 Numeric
 
 This definition only refers to the 10 ASCII numeric characters. In 
some character sets numbers may appear in duplicated positions, 
although this is obviously not a good idea. The usage of ISO 10646 
would obviate this problem.
 
 Alphabetic
 
 These reference upper-case and lower-case as being 26 characters 
each. With languages that have a similar usage to English this is 
inadequate since the number of characters in a case varies from 
language to language. Furthermore the number of characters in 
upper-case could be different to that in lower-case. For example in 
Modern French accents are not normally used on upper-case letters. 
Note|TAB|It may still be a good idea to have an upper-case equivalent to 
a lower-case character so that the conversion from lower to upper 
to lower again does not lose information. However this would cause 
other problems with comparing strings for equality.
 
 In Japanese another two alphabets is used (Hiragana and Katakana) 
in addition to the English one and the ideograms (Kanji). Similarly 
some East European languages use Cyrillic in addition to English. It 
therefore makes sense that additional pattern codes be available to 
match these particular requirements.
 
 The pattern match P is described as punctuation. These characters 
also vary from language to language, for example in Spanish  is 
used in addition to ? to bracket a question.
 
 Other
 
 The issue of control characters and every character, pattern codes C 
and E, are less important but probably need redefinition in relation 
to any chosen character set.
 
 Display Issues
 
 The issues of entering and displaying multi-lingual data are, 
strictly speaking, not in the scope of this particular task group of 
SC1 of the MDC. However a number of issues are raised which point 
to potential solutions which fit into the scheme of modifications to 
MUMPS proposed here.
 
 The main reason for handling data-entry and display separately is 
that modern technology has given much greater control over 
character and data attributes, eg fonts, character sizes, highlighting 
and thus it seems that a more general solution is required to ensure 
that any extensions proposed for multi-lingual handling do not 
prevent further extensions for these other features.
 
 Handling natural language data may mean the pressing of several 
characters on the keyboard in order to enter a single character of 
data, eg accented characters being entered on a keyboard which does 
not provide this as a single key stroke. It also may involve 
outputting several characters, possibly including backspaces, partial 
line up/down, in order to produce an acceptable printed image. 
Furthermore the order of characters may change their display 
representation, as in Arabic, and the size may vary, as in the 
Japanese usage of Kanji/Roman characters.

-- 
Hokey				We are Space Guys.  We know what we are doing.