DIAMOND.JON%forum.va.gov (04/11/91)
This is the first in s set of documents about Natural Language Handling that will be submitted to SC1/TG1 for the June meeting and will go out in the pre-meeting mailing. Assuming that the TG is reasonably happy with them then I hope that we can get them to SC1 type B in June (as they will be coming out of the TG). I would like your comments ASAP so that I can revise them as necessary before submitting them to Don Piccone for the pre-meeting mailing. This really means all comments have to be back to me by 15th April at the latest. It is clear that, at the very least, the document that follows here is incomplete in a number of areas and I will continue working on it so that I can provide a revised version AT the meeting for everyone there. So if you can't make the deadline above please send your comments ASAP anyway. Be warned the document that follows here is quite long... Thanks in advance for your assistance. ----------------------------------------------Natural Language Handling Jon Diamond, Hoskyns Group, 1.|TAB|INTRODUCTION This document is the result of discussions over a long period of time within the MUMPS Development Committee by a significant number of people. It is produced by task group 1 of Subcommittee 1 (Language) of the MDC. It attempts to provide -|TAB|a complete background to the issues of handling multiple languages for those who may be unfamiliar with the problems -|TAB|a practical solution to these problems by |TAB|proposing modifications to the MUMPS language |TAB|suggestions to implementors as to how these modifications might be implemented and |TAB|directions to programmers as to how to use these modifications in order to achieve portable multi-lingual applications programs -|TAB|a description of other related issues which are not part of the mandate of this task group, eg external character representations. By its very nature this document cannot cover all possibilities since the various authors do not have experience of all languages now in existence. It is believed that the direction is correct and open-ended, whilst minimising the implementation impacts. Feedback is, however, urgently required in order to ensure that this is so. 2. |TAB|JUSTIFICATION OF PROPOSED CHANGE 2.1|TAB|Needs MUMPS is an international language, with versions already available for Japanese, Chinese, Russian and some other European languages. These versions are not yet explicitly included in the MUMPS standard and some conventions adopted in each language-specific implementation are not extensible to the multi-lingual environment. International acceptance of software in the English language is now becoming more problematic. Each language group is now insisting that software is provided for them in their own language and their own character set. If MUMPS is to continue to spread internationally then it must embrace these requirements and provide the basis for these applications to be supplied by MUMPS application providers. Finally, systems are now being implemented whose users will be accessing them in their own language and storing data, which may subsequent be displayed and manipulated by users in a different language. This is an emerging requirement which will, with the increasing integration of systems across Europe and the world, become more and more important in the future. MUMPS has the opportunity to provide leadership in the programming language community if it can define features that will enable it to process multiple character sets. The timing is also appropriate, since there are a number of language features under discussion which facilitate multi-lingual capabilities without adding a complex array of new language features. Furthermore since MUMPS has not defined a character in terms of the number of bits in which it stored it is anticipated that the mechanisms provided within MUMPS for addressing these problems will be significantly simpler than in other languages. 2.2|TAB|Existing Practice There are several versions of MUMPS which can accommodate character sets other than ASCII. All of these versions, however are bilingual in nature, such as an English-Japanese, English- Chines and English-Russian. Extensions in these implementations have been adopted to permit string operators ($FIND, $LENGTH, $EXTRACT, $PIECE etc) to deal with logical character lengths. In some of these systems these extensions accommodate character sets in which a single character is represented by more than one byte, with in general escape sequences prefixing character set changes. In other systems 8-bit bytes are used, but the upper 128 characters are normally significantly different to US systems and even other non-US systems with the same MUMPS implementation Extensions to permit pattern match operators specific to some other languages have also been adopted for some languages. String subscript collation have also been extended to accommodate 8-bit and two-byte character codes. These extensions are neither a part of the MUMPS standard, nor are they truly multi-lingual. There is no uniformity in these extensions, and they do not include all the features necessary for multi-lingual processing. Finally, they typically provide an explicit mechanism for addressing many of the issues involved in multi-lingual processing. A mainly implicit mechanism would greatly improve the availability of multi-lingual applications and also minimise the errors introduced by the additional programming requirements. Existing practice on non-MUMPS systems is diverse. On the IBM PC each language has its own code page, together with a keyboard mapping. This gives each language a unique set of 8-bit codes and makes data interchange problematic. DEC have defined a multi- national character set which operates across all their hardware platforms. This makes interchange easy, but creates a unique character set which does not encompass all requirements, eg Japanese. The issues of collation do not appear to have been satisfactorily addressed in these two environments. Unix systems are now beginning to address the problems, with a lot of work being undertaken in POSIX and X/Open on these issues. Some of this is referenced later in this document. 2.3 |TAB|Goals The main objective of the current work is produce a framework in which application providers can write applications which are portable between systems using different languages and also be able to support multiple languages simultaneously. This encompasses not only the display of data entry screens in the terminal users own languages, but also the storage and manipulation of data which may be entered in one of several languages. It is also hoped that this will also lead to the ability to be able to interchange multi-lingual data between systems. It is anticipated that this will initially be between homogeneous systems, but ultimately the goal is to be able to communicate between heterogeneous MUMPS systems and even to non-MUMPS systems. The objective is limited in this set of proposals to the ability to be able to process multi-lingual information. It does not address all the issues related to internationalisation as it is understood in other systems. For example POSIX makes a definition of locale which includes a whole host of other issues, eg currency symbol, date format etc. A list of these issues is presented in Appendix A. These will be addressed in other papers and in other ways within the MUMPS language. One of the design goals is also to enable, as far as possible, existing applications to operate in a different language environment or even a multi-lingual one to that which they were created in. If this is achievable then a major step forward will be taken in reducing the cost of software written for these languages and also increasing the potential user base of existing applications at minimal cost. The major other design goal is to minimise the impact on the structure and syntax of the language in providing these capabilities. This is important in ensuring that no major additional learning time is needed in order to be able to use the new facilities. Finally, it is a design aim to minimise the total implementation effort and performance impact of these changes. The proposed changes will cause modifications to many if not all existing MUMPS implementations. Although this particular design aim is not an overriding criteria it will obviously be beneficial to all parties if the effort involved and the impact on existing systems can be minimised. 3.|TAB|PROBLEM DEFINITION Character Sets MUMPS as it currently exists today only recognises a single character entity. There is no statement within the standard as to the number of bits which represent a character. The definition was carefully phrased so as to be vague as to this issue, since MUMPS handles these kinds of issues at a conceptual (logical) level and not a physical one. The only requirement from all Standard MUMPS systems is that the $CHARACTER function when applied to a number between 0 and 127 produces the equivalent character from the ASCII character set (X3.4-1984), with $ASCII being the reverse function producing a numeric value corresponding to a character. The implication of this requirement is that portable programs can only rely on 7-bit ASCII characters being available in any system. As was mentioned earlier there are a number of different systems using 8- bit characters which use characters 128-255 differently. It is important to distinguish a number of different usages of character sets for the purposes of further discussion: 1.|TAB|MUMPS Representation |TAB|This is what was discussed above. It corresponds to the character that is presented to the application program by the underlying MUMPS implementation. There is no restriction as to its length in bits, or currently its interpretation above the 7-bit limit. 2.|TAB|Internal representation |TAB|This is the physical storage mechanism within the implementors systems for a character. It is conceivable that this might be different for locals and globals, depending on the different costs involved for CPU, memory, disk storage and I/O time. |TAB|One issue that is sometimes raised, especially in other programming languages, is that if the character set to be used in 1) above is 16 or 32 bits then the storage needed will automatically be doubled or quadrupled. This need not necessarily be true as it is an implementors decisions as to the internal representation. There have been a number of papers presented discussing encoding schemes for multi-octet characters in order to reduce the storage requirements. One possible scheme is to use the top bit of a byte in order to indicate that the next byte is also part of the character. This leads to a storage of one byte for characters 0-127, two bytes for 128- 16383 etc. 3.|TAB|Display representation |TAB|This is the physical character appearing on a display (output) device. Strictly speaking in MUMPS terms this corresponds to the code number which is sent to the device to cause a character display. 4.|TAB|Interchange representation |TAB|This is a character or code number which is used for interchange between systems. Both the sending and receiving systems need to agree on the specific meaning of a code. In general terms there is no difference between this representation and the display representation, however display representations will be driven by specific devices and interchange representations by standards definitions. From now on the discussion will focus on the MUMPS, or logical, representation. When any other representation is being discussed this will be specifically identified. The handling of external representation of characters is not part of the scope of work of this task group, however a number of issues are obvious and some possible solutions which address the needs of natural languages are proposed. Clearly the issues of font, character size and other attributes need to be solved for the English language. If these general issues are solved then the handling of characters in natural languages will automatically be solved. Collation For natural languages the issue of the collating sequence of data is extremely important and one of the most difficult issues. It is obvious that with a different character set that the standard collation sequence that we have traditionally assumed as being implied by the ASCII collating sequence will have to change. What is not clear at first glance is how significant the changes actually need to be. Those languages that are most different to English in visual appearance, such as Japanese and Chinese, have fundamentally different collating sequences. For example in Japanese the main ordering in most dictionaries has been by number of strokes used in drawing the character. This ordering is not reflected in the order of characters described in any Japanese computer coded character set. In some ways this is simpler than the collation required for many Indo-European languages. Some years ago it was thought by many of the non-American participants within the MUMPS Development Committee that the usage of diacritical characters (accented ones such as ) could be handled by a relatively simple transform with two usages of the $TRANSLATE function (see X11/SC1/???). However recent papers by LaBont and van Wingen within the ISO programming language group working on international character sets has showed that this is not true. It appears that we need at least a four-pass transformation in order to be able to handle French. (It should also be noted at this point that the papers presented within this forum have shown that the collation required for French as used in Canada is different from that in France, similarly for German in Germany and Switzerland.) With this new information it is clear that our earlier simplistic notions are no longer able to produce the required results. Even looking at dictionaries and telephone directories in English shows a number of anomalies which cannot be easily rectified. Furthermore since we cannot be experts in every single language in the world if we are to solve the problem properly then we must generalise our notion of collation and call on a more general algorithmic approach. It is therefore recommended that collation be handled by specifying an appropriate algorithm for each language. This will calculate the unique collation value for a string. (The ISO work is aimed at being able to compare two strings and therefore can be specified differently. We, however, have a different requirement since we are assuming that the values will be used as subscripts for storage in globals.) It should not be assumed that this algorithm is reversible, ie given a collation value that the original input can be reconstructed. An example from English should suffice to demonstrate this - if we require collation that was totally case insensitive we would transform A and a both to a. We would then be unable to decide from the collation value a whether the character in the original string was A or a. The issue of collation affects MUMPS in two particular ways - subscript ordering (both for globals and locals), with the appropriate impact on $ORDER and $QUERY and the follows operator ]. Most of the discussion has been focussed on the storage issue, however a full solution also needs to address how these operators are also to be affected. There are a number of reasons for using an implicit collation scheme rather than an explicit schemes. The explicit one would require each application programme to use some kind of function whenever a data reference is made. This would mean changes to every existing programme and it would also be error prone. Furthermore, because of the issue that have been discussed above it would require at least a four level scheme, thereby increasing the maximum size requirement for a subscript by a factor of four, ie from the current 128 to 512 , and also requiring a doubling of the current string limit. Since the number of levels required for every language is not known at this time it is still not clear that this would address every possible language. Pattern Matching It is clear from the considerations of most languages that the basic pattern match codes defined in the MUMPS standard are insufficient and inadequate. These codes may be grouped into three classes - numeric (N), alphabetic (AULP) and other (CE). For the purposes of multi-lingual handling none of the definitions for any of these categories is acceptable, even though at first sight only the alphabetic ones might be affected. Numeric This definition only refers to the 10 ASCII numeric characters. In some character sets numbers may appear in duplicated positions, although this is obviously not a good idea. The usage of ISO 10646 would obviate this problem. Alphabetic These reference upper-case and lower-case as being 26 characters each. With languages that have a similar usage to English this is inadequate since the number of characters in a case varies from language to language. Furthermore the number of characters in upper-case could be different to that in lower-case. For example in Modern French accents are not normally used on upper-case letters. Note|TAB|It may still be a good idea to have an upper-case equivalent to a lower-case character so that the conversion from lower to upper to lower again does not lose information. However this would cause other problems with comparing strings for equality. In Japanese another two alphabets is used (Hiragana and Katakana) in addition to the English one and the ideograms (Kanji). Similarly some East European languages use Cyrillic in addition to English. It therefore makes sense that additional pattern codes be available to match these particular requirements. The pattern match P is described as punctuation. These characters also vary from language to language, for example in Spanish is used in addition to ? to bracket a question. Other The issue of control characters and every character, pattern codes C and E, are less important but probably need redefinition in relation to any chosen character set. Display Issues The issues of entering and displaying multi-lingual data are, strictly speaking, not in the scope of this particular task group of SC1 of the MDC. However a number of issues are raised which point to potential solutions which fit into the scheme of modifications to MUMPS proposed here. The main reason for handling data-entry and display separately is that modern technology has given much greater control over character and data attributes, eg fonts, character sizes, highlighting and thus it seems that a more general solution is required to ensure that any extensions proposed for multi-lingual handling do not prevent further extensions for these other features. Handling natural language data may mean the pressing of several characters on the keyboard in order to enter a single character of data, eg accented characters being entered on a keyboard which does not provide this as a single key stroke. It also may involve outputting several characters, possibly including backspaces, partial line up/down, in order to produce an acceptable printed image. Furthermore the order of characters may change their display representation, as in Arabic, and the size may vary, as in the Japanese usage of Kanji/Roman characters. -- Hokey We are Space Guys. We know what we are doing.