hokey%plus5@plus5.UUCP (09/21/85)
Distribution:
Volume-Issue: 2.3
The issue of collating and pattern matching needs to be addressed when
Mumps exists in an environment which is anything other than 7 bit ASCII.
The most extreme example is EBCDIC. While it will be useful to have a
7 bit ASCII emulation mode on an EBCDIC machine, there is also a need
to operate Mumps in the native characterset.
I would like to see the requirements for pattern match codes, $C()/$A()
mapping, and collating sequence tailored to fit the environment, in order
to provide implementors and users as much latitude as possible.
This can best be done by specifying the behavior of pattern match codes,
$C()/$A(), and collating sequences on a per-character-set basis, as well
as an overall, general specification.
Two other languages have already done this very thing: MAINSAIL and C.
MAINSAIL (MAchine INdependent Stanford Artificial Intelligence Language)
has this to say:
2.2 CHARACTER SET
MAINSAIL does not specify the exact character set; instead, only the
following is guaranteed:
1) A unique character corresponds to each of the following characters:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789
! " # $ & ' ( ) * + , - . / : ; < = > ? [ ] ^ (uparrow) _ (backarrow)
space (blank)
tab (horizontal tab)
eol (end-of-line: one or two characters)
eop (end-of-page)
Of course MAINSAIL cannot guarantee the graphics associated
with each character, but they should be chosen to approximate
those above, which are from the (1963) ASCII character set.
The graphics for the "^" (uparrow) and "_" (backarrow) characters
were changed in the 1968 ASCII standard to be circumflex and underline,
respectively. MAINSAIL allows "**" to be used in place of "^"
(the exponentiation operator), and ":=" in place of "_" (the
assignment operator).
2) Associated with each character is an integer code. THese
character codes range from 0 to n, where n is at least 127.
3) A...Z are alphabetically ordered, but not necessarily contiguous.
4) a...z are alphabetically ordered, but not necessarily contiguous.
5) 0...9 are numerically ordered and contiguous.
Aside from functions which test for uppercase/lowercase/alpha characters,
MAINSAIL also supplies prevAlpha(i) and nextAlpha(i), which do the obvious
things when given b...zB...Z and a...yA...Y, respectively.
The proposed C Standard (Ansi X3J11/85-008) says:
The following characters are required in the source character set:
the 52 upper-case and lower-case characters of the English alphabet;
the 10 decimal digits; the following 29 graphic characters:
!"#%&'()*+,-./:;<=>?[\]^_{|}~
the space character, and control characters representing horizontal
tab, vertical tab, and form feed.
Functions exist to test if a given character is Alpha, Numeric, Control,
Printable, Punctuation (any printing character except SPACE, digit, or
letter), and several combinations of these types. I was unable to find
any information regarding either ordering or relative positioning of
characters.
Let's "open the doors" in the Standard to include IBM and non-English
languages in a way which maximizes usability.