[mod.std.mumps] Collating and pattern match outside of 7 bit ASCII

hokey%plus5@plus5.UUCP (09/21/85)

Distribution:
Volume-Issue: 2.3

The issue of collating and pattern matching needs to be addressed when
Mumps exists in an environment which is anything other than 7 bit ASCII.

The most extreme example is EBCDIC.  While it will be useful to have a
7 bit ASCII emulation mode on an EBCDIC machine, there is also a need
to operate Mumps in the native characterset.

I would like to see the requirements for pattern match codes, $C()/$A()
mapping, and collating sequence tailored to fit the environment, in order
to provide implementors and users as much latitude as possible.

This can best be done by specifying the behavior of pattern match codes,
$C()/$A(), and collating sequences on a per-character-set basis, as well
as an overall, general specification.

Two other languages have already done this very thing: MAINSAIL and C.
MAINSAIL (MAchine INdependent Stanford Artificial Intelligence Language)
has this to say:

  2.2 CHARACTER SET

  MAINSAIL does not specify the exact character set; instead, only the
  following is guaranteed:

  1)	A unique character corresponds to each of the following characters:

	ABCDEFGHIJKLMNOPQRSTUVWXYZ
	abcdefghijklmnopqrstuvwxyz
	0123456789
	! " # $ & ' ( ) * + , - . / : ; < = > ? [ ] ^ (uparrow) _ (backarrow)
	space (blank)
	tab (horizontal tab)
	eol (end-of-line: one or two characters)
	eop (end-of-page)

	Of course MAINSAIL cannot guarantee the graphics associated
	with each character, but they should be chosen to approximate
	those above, which are from the (1963) ASCII character set.
	The graphics for the "^" (uparrow) and "_" (backarrow) characters
	were changed in the 1968 ASCII standard to be circumflex and underline,
	respectively.  MAINSAIL allows "**" to be used in place of "^"
	(the exponentiation operator), and ":=" in place of "_" (the
	assignment operator).

  2)	Associated with each character is an integer code.  THese
	character codes range from 0 to n, where n is at least 127.

  3)	A...Z are alphabetically ordered, but not necessarily contiguous.

  4)	a...z are alphabetically ordered, but not necessarily contiguous.

  5)	0...9 are numerically ordered and contiguous.

Aside from functions which test for uppercase/lowercase/alpha characters,
MAINSAIL also supplies prevAlpha(i) and nextAlpha(i), which do the obvious
things when given b...zB...Z and a...yA...Y, respectively.

The proposed C Standard (Ansi X3J11/85-008) says:

	The following characters are required in the source character set:
	the 52 upper-case and lower-case characters of the English alphabet;
	the 10 decimal digits; the following 29 graphic characters:
	    !"#%&'()*+,-./:;<=>?[\]^_{|}~
	the space character, and control characters representing horizontal
	tab, vertical tab, and form feed.

Functions exist to test if a given character is Alpha, Numeric, Control,
Printable, Punctuation (any printing character except SPACE, digit, or
letter), and several combinations of these types.  I was unable to find
any information regarding either ordering or relative positioning of
characters.

Let's "open the doors" in the Standard to include IBM and non-English
languages in a way which maximizes usability.