[comp.std.mumps] Pattern match

DIAMOND.JON%forum.va.gov (04/11/91)

Pattern Definition Proposal|TAB||TAB|X11/SC1/TG1/91-2
Thursday, April 4, 1991|TAB||TAB|Page 1 of 5
 
1. |TAB|IDENTIFICATION
 
1.1 |TAB|Title
 
Pattern definitions
 
1.2|TAB|MDC proposer and sponsor
 
Proposer:|TAB|Jon Diamond, Hoskyns Group, 130 Shaftesbury Avenue,  
 
2. |TAB|JUSTIFICATION
 
2.1|TAB|Needs
 
The major need is to be able to define (re-define) pattern 
codes for processing text in non-English languages. Currently 
there is no mechanism for being able to specify or extend 
pattern matches. See separate document for further 
information.
 
2.2|TAB|Existing practice in the area of the proposed change
 
As far as is known there are no current implementations which 
allow for any capability for user-definability of patterns, 
or extandability above those defined by the implementation.
 
 
3.|TAB|DESCRIPTION
 
3.1|TAB|General description of the proposed change
 
There are three parts to this proposal. The first addresses 
the issue of being able to have access to extended pattern 
capabilities by using structured system variables to define 
the meaning of patcodes in an explicit fashion.
 
The second extends the definition of patatom in a similar 
fashion to the way glvn has been extended to allow an 
environment specification. This will allow applications to be 
able to select between different sets of pattern match 
tables, in a similar fashion to the way a global is selected 
from a different environment in networking. This will allow 
for the switching between different languages definitions of 
the same patcodes.
 
The third allows an application to force usage of the 
patcodes in the base ASCII set that we currently use, to 
override the current default set, without having to place a 
reserved name in an environment specification. This will 
allow programs to be able to verify that, for example, 
entered characters are alphabetic in all possible character 
sets used.
 
3.2|TAB|Annotated examples of use
 
The current pattern codes can be set up in the following 
fashion (for 7-bit characters), subject to access control:-
 
KILL ^$PATTERN
FOR I=0:1:31,127 SET ^$PATTERN("C","MEMBER",$CHAR(I))=""
FOR I=32:1:47 SET ^$PATTERN("P","MEMBER",$CHAR(I))=""
FOR I=48:1:57 SET ^$PATTERN("N","MEMBER",$CHAR(I))=""
FOR I=58:1:64 SET ^$PATTERN("P","MEMBER",$CHAR(I))=""
FOR I=65:1:90 SET ^$PATTERN("U","MEMBER",$CHAR(I))=""
FOR I=91:1:96 SET ^$PATTERN("P","MEMBER",$CHAR(I))=""
FOR I=97:1:122 SET ^$PATTERN("L","MEMBER",$CHAR(I))=""
FOR I=123:1:126 SET ^$PATTERN("P","MEMBER",$CHAR(I))=""
MERGE 
^$PATTERN("A")=^$PATTERN("U"),^$PATTERN("A")=^$PATTERN("L")
MERGE 
^$PATTERN("E")=^$PATTERN("C"),^$PATTERN("E")=^$PATTERN("P")
MERGE 
^$PATTERN("E")=^$PATTERN("A"),^$PATTERN("E")=^$PATTERN("N")
 
To add additional characters to these pattern codes for other 
languages would require coding like:-
 
SET ^$PATTERN("U","MEMBER",$A(""))=""
 
Programs would then perform pattern matches in exactly the 
same way as they do now and get 
the expected results, eg the following:-
 
SET A=""
WRITE A?1"U"
 
would produce the result 0 currently and 1 with the above SET 
having taken place.
 
In an environment which normally runs in English, but also 
has pattern code tables set up for German the previous 
example would need modifying since the SET would only apply 
to the ^$PATTERN for German.
 
SET A=""
WRITE A?1"U"
 
produces 0 in the normal (English) case, but
 
SET A=""
WRITE A?1|"GERMAN"|"U"
 
would produce the value 1, as expected.
 
Given the logical extension to environments then the German 
table might be set up by, say,
 
KILL ^|"GERMAN"|$PATTERN
MERGE ^|"GERMAN"|$PATTERN=^$PATTERN
SET ^|"GERMAN"|$PATTERN("U","MEMBER",$A(""))=""
SET ^|"GERMAN"|$PATTERN("L","MEMBER",$A(""))=""
...
MERGE ^|"GERMAN"|$PATTERN("A")=^|"GERMAN"|$PATTERN("L")
MERGE ^|"GERMAN"|$PATTERN("A")=^|"GERMAN"|$PATTERN("U")
 
and a complete switch to German from English by
 
SET ^$JOB($J,"PATTERN")="GERMAN"
 
This last action is analogous to changing the environment for 
networking for globals, which 
according to current proposals would be achieved by
 
SET ^$JOB($J,"GLOBAL")="XYZ"
 
NOTE Whether any change of environment is possible is an 
access security issue. See 
X11/SC7/TG1/91-3 for more details.
 
The final change proposed would be used to check whether a 
character was ASCII etc. Therefore if an application was 
running in the German environment, but needed to know whether 
the character was portable to a non-German environment, the 
coding would be:-
 
IF A?1~A
 
3.3|TAB|Formalization
 
1.  Extended pattern code definition
 
In part I section 2.2 add the definition of ^$PATTERN after 
the $LOCK structured system variable:
 
^$P[ATTERN]
|TAB|will provide information regarding pattern codes and 
their definition
 
In section 2.3.3 replace the definition of patcode with
 
patcode ::= | alpha | ...
 
At the end of the following paragraph replace 
 
", as follows."
 
with
 
". A character "char" belongs to a patcode class if 
$DATA(^$PATTERN(patcode,"MEMBER",$ASCII(char))) is true. The 
initial values for the following patcodes are defined:"
 
Add another section to Part II
 
x. Pattern codes
 
The only pattern codes that are required to be provided are 
A, C, E, L, N, P, U with the definitions as per section 2.3.3 
of Part I. Portable programs cannot rely on changing the 
default environment specification.
 
Add another section (not sure where to)
 
x. ssvn semantics
 
The following ssvns are defined:-
 
^$PATTERN(patcode,"MEMBER",intexpr) = ""
 
The SET command can be used to assign a value to this ssvn. 
The KILL command can be used to delete individual nodes, sub-
trees or the entire ssvn. The meaning is that implied by 
section 2.3.3.
 
^$JOB($J,"PATTERN") = default pattern environment 
specification
 
(The remaining text from X11/SC7/TG1/91-3) section 3.3 also 
applies to this entry in ^$JOB.)
 
 
2.  Alternate pattern code access
 
In part I section 2.3.3 replace the definition of patatom 
with
 
|TAB||TAB||TAB||TAB||TAB||TAB|| [ environment ] patcode|TAB||
|TAB|patatom ::= repcount|TAB|||TAB||TAB||TAB||TAB||
|TAB||TAB||TAB||TAB||TAB||TAB|| |TAB|strlit|TAB||TAB||
 
See section 3.2.2.2 for a definition of environment.
 
 
3.  ASCII pattern access
 
In part I section 2.3.3 add to the definition of patatom 
another option (after the repcount)
 
|TAB||TAB||TAB||TAB|| |TAB|~ patcode|TAB||
 
making the definition of patatom 
 
|TAB||TAB||TAB||TAB||TAB||TAB|| [ environment ] patcode|TAB||
|TAB|patatom ::=   repcount|TAB|| |TAB|~ patcode|TAB||TAB||
|TAB||TAB||TAB||TAB||TAB||TAB|| |TAB|strlit|TAB||TAB||
 
and add a new paragraph 
 
Where the form ~ patcode is used then the characters which 
match the patcode are defined to be those described below, 
irrespective of the current definition of the patcode.
 
 
4.|TAB|IMPLEMENTATION IMPACTS
 
4.1|TAB|Impact on existing user practices and investments
 
Existing applications written for the English language would 
be easier to apply to other languages. No existing 
applications should be affected.
 
4.2|TAB|Impact on Existing Vendor Practices and Investments
 
The impact on vendors is not insignificant, although some 
vendors have experience with the problems of different 
languages. A new table mechanism will need to be set up 
within implementations to allow for the variability of 
pattern match codes. This will need to be definable in a 
similar way to the existing UCI/directory/namespace concepts 
for globals, modifiable by a (restricted class of) user etc.
 

-- 
Hokey				We are Space Guys.  We know what we are doing.