COK2%UK.AC.DURHAM.MTS@AC.UK (Barry_Cornelius) (10/20/86)
All About Strings
M2WG-N95: Issue 2: 30th April 1986
Barry Cornelius
Department of Computer Science
University of Durham
Durham DH1 3LE United Kingdom
1. Introduction
This paper discusses some of the problems with Modula-2's definition
of strings and introduces the two proposals that were discussed at the
February meeting of the Modula-2 Working Group of the British
Standards Institution. The M2WG would be interested to receive your
comments on these proposals.
2. Strings according to the definitions in the Report
2.1 What is meant by 'string'?
$3 of the Modula-2 Report defines the term 'string' to mean
'a sequence of characters enclosed in quote marks'. Hence the word
'string' is used to refer to a literal such as "FRED". Throughout $2
of this paper 'string' will be used with this meaning.
$3 of the Report also says that a string consisting of n characters is
of type:
ARRAY [ 0 .. n-1 ] OF CHAR
So, a string like "FRED" is of the type:
ARRAY [ 0 .. 3 ] OF CHAR
2.2 Simple uses of 'strings'
By the normal compatibility rules (defined in $6.3 and $9.1 of the
Report), a string like "FRED" is assignment compatible with any
variable which is also of its type. Hence, the following is allowed:
VAR s : ARRAY [ 0 .. 3 ] OF CHAR;
...
s := "FRED";
However, Modula-2 places restrictions on the types of objects which
may be compared using the relational operators. In fact, $8.2.4 of
the Report says:
- 2 -
the ordering relations apply to the basic types, INTEGER,
CARDINAL, BOOLEAN, CHAR, REAL, to enumerations, and to subrange
types.
Thus, unlike Pascal, Modula-2 does not permit strings to be used in
comparisons. For example:
VAR s : ARRAY [ 0 .. 3 ] OF CHAR;
...
IF s = "FRED" THEN ...
is not permitted. However, it is usual for implementations to include
a module providing operations on strings:
FROM Strings IMPORT CompareStr;
VAR s : ARRAY [ 0 .. 3 ] OF CHAR;
...
IF CompareStr(s, "FRED") = 0 THEN ...
We will now look at some of the problems that occur with the current
definition of strings.
2.3 Problem 1: Assigning 'short strings' to 'long variables'
The section of the Report concerned with Assignment Statements ($9.1)
gives the following feature of Modula-2:
A string of length f can be assigned to a string variable of
length t > f. In this case, the string value is extended with a
null character (0C).
[The Report actually uses n1 and n2 rather than f and t; f and t mean
'from' and 'to'.] Thus:
VAR s : ARRAY [ 0 .. 3 ] OF CHAR;
...
s := "NW";
is legal and is equivalent to:
VAR s : ARRAY [ 0 .. 3 ] OF CHAR;
...
s[0] := "N";
s[1] := "W";
s[2] := 0C;
Note that the Report leaves the value of s[3] undefined.
There are three points to be made about this feature of
Modula-2.
Firstly, this feature permits array assignments that leave
component(s) of the array undefined. This appears very strange to
- 3 -
those of us who believe in the approach taken by the BS/ISO/ANSI/IEEE
Pascal standard. Formulating a rigorous definition where (components
of) variables are permitted to have undefined values is possible but
for the reasons spelled out in a paper by Derek Andrews (M2WG-N72),
it would not be very easy.
The second point is that the underlying representation of a variable
that contains a string is given in detail. And it is not a very
convenient representation: a variable that contains a string is an
array and the string ends at the first occurrence of 0C or at the end
of the array if there is no 0C. It has been argued that this amount
of detail about the representation of strings should not be given, and
that the chosen representation is neither convenient for computers
where strings are normally null-terminated nor convenient when the
length of the string is stored.
Finally, the value 0C is inappropriate in some character sets.
2.4 Problem 2: What are the types of "A" and ""?
In $3 of the version of the Report given in the first and second
editions of Wirth's book "Programming in Modula-2", there appears the
sentence:
A single-character string is of type CHAR, a string consisting of
n>1 characters is of type
ARRAY [ 0 .. n-1 ] OF CHAR
This means that:
s := "ABC";
is legal whereas:
s := "A";
is not if s is declared to be of type ARRAY [ 0 .. 3 ] OF CHAR. And,
although the syntax allows the literal "", the Report remains quiet
about its type. However, presumably:
s := "";
is allowed.
There are similar problems when passing such values to value
parameters. Consider a procedure whose heading is:
PROCEDURE DoOperationsUsingStringValue(str:ARRAY OF CHAR);
Although the calls:
DoOperationsUsingStringValue("FRED");
DoOperationsUsingStringValue("NW");
DoOperationsUsingStringValue("");
are permitted, it is not legal to do:
- 4 -
DoOperationsUsingStringValue("A");
Instead one has to duplicate the code of the string handling procedure
in another procedure (presumably called DoOperationsUsingCharValue).
In the third edition of Wirth's book, the sentence from the Report
quoted above has been replaced by:
A string consisting of n characters is of type
ARRAY [ 0 .. n-1 ] OF CHAR
The third edition also includes a new sentence in $9.1:
A string of length 1 is compatible with the type CHAR.
So "A" is now considered to be of the type ARRAY [ 0 .. 0 ] OF CHAR
rather than CHAR. This means that most of the above problems have
been removed. However, this new definition seems to imply that the
type of "" is:
ARRAY [ 0 .. -1 ] OF CHAR
The type of "" is particularly important when "" is passed as an
actual parameter to a value parameter which is an open array
parameter. For example, if "A" is passed to
DoOperationsUsingStringValue, then the value parameter, str, is of
type ARRAY [ 0 .. 0 ] OF CHAR and thus HIGH(str) would produce 0. But
what happens when "" is passed?
3. The current position of the BSI's Modula-2 Working Group
At the February meeting of the M2WG, there were (at least) two
opposing points of view:
o those who thought the holes in the language definition in the
area of strings could be patched up;
o those who thought that Modula-2's attempt at strings was
appalling and that the language had to have a new predefined type
for strings.
There was no real consensus. Instead, we decided to follow up both
proposals in further detail and then to seek the views of the Modula-2
community. The two proposals are presented in $4 and $5 of this
paper.
4. Proposal 1: Patch up the existing definition
A: If a program assigns a string literal to an array-of-char-
variable and the literal is "shorter" than the variable, then the
variable will be padded THROUGHOUT on the right with the
StringTerminator value.
- 5 -
B: The constant StringTerminator will be defined in a module,
probably, SYSTEM. For some implementations, it will have the
value CHR(0).
C: The empty string literal is of type 'empty-char' which is
compatible with the types ARRAY [ 0 .. ^^^ ] OF CHAR.
D: A string literal of length n (n>0) is of type
ARRAY [ 0 .. n-1 ] OF CHAR.
E: A string literal of length 1 is compatible with the type CHAR.
F: If a formal parameter is a value parameter which is an open array
parameter and the actual parameter is a string literal then the
formal parameter will have:
o the same type as the actual parameter if the string literal
is not empty;
o the type ARRAY [ 0 .. 0 ] OF CHAR and the value of the
element will be StringTerminator if the actual parameter is
an empty string literal.
G: There will be a module in the Standard library which will provide
a comprehensive set of procedures for operations on string
objects.
5. Proposal 2: Add a new predefined string type
A: A new string type will be added to the language. In essence,
this will be similar to that available in UCSD Pascal or in
Atholl Hay's proposal for ISO Standard Extended Pascal
(M2WG-N65).
B: The Modula-2 language will contain ways (e.g., operators or
standard procedures) to provide the operations of:
assignment,
relational operators,
string <--> char type transfers,
string <--> character array type transfers,
denotation for the empty string,
denotation for string literals,
operation to return the length of a string,
operation to concatenate two strings,
operation to return a substring from a string,
operation to find the start of a substring within a string.
C: Other operations will be provided by a module in the Standard
library since such procedures may be written in terms of the
language.
D: If possible, this new string type's definition will be similar to
that being provided in ISO Standard Extended Pascal.
- 6 -
6. Postscript
Please send your comments on these two proposals to me as soon as
possible. You may find the following electronic mail addresses
useful:
Barry_Cornelius%mts.durham.ac.uk@UCL-CS.ARPA
Barry_Cornelius@uk.ac.durham.mts
bjc@uk.ac.nott.cs