[net.lang.mod2] All About Strings

COK2%UK.AC.DURHAM.MTS@AC.UK (Barry_Cornelius) (10/20/86)

                          All About Strings

                  M2WG-N95: Issue 2: 30th April 1986


                           Barry Cornelius

                    Department of Computer Science
                         University of Durham
                   Durham  DH1 3LE  United Kingdom



1. Introduction

This paper discusses some of the problems with Modula-2's definition
of strings and introduces the two proposals that were discussed at the
February meeting of the Modula-2 Working Group of the British
Standards Institution.  The M2WG would be interested to receive your
comments on these proposals.


2. Strings according to the definitions in the Report


2.1 What is meant by 'string'?

$3 of the Modula-2 Report defines the term 'string' to mean
'a sequence of characters enclosed in quote marks'.  Hence the word
'string' is used to refer to a literal such as "FRED".  Throughout $2
of this paper 'string' will be used with this meaning.

$3 of the Report also says that a string consisting of n characters is
of type:

     ARRAY [ 0 .. n-1 ] OF CHAR

So, a string like "FRED" is of the type:

     ARRAY [ 0 .. 3 ] OF CHAR



2.2 Simple uses of 'strings'

By the normal compatibility rules (defined in $6.3 and $9.1 of the
Report), a string like "FRED" is assignment compatible with any
variable which is also of its type.  Hence, the following is allowed:

     VAR s : ARRAY [ 0 .. 3 ] OF CHAR;
     ...
     s := "FRED";

However, Modula-2 places restrictions on the types of objects which
may be compared using the relational operators.  In fact, $8.2.4 of
the Report says:







                              - 2 -


     the ordering relations apply to the basic types, INTEGER,
     CARDINAL, BOOLEAN, CHAR, REAL, to enumerations, and to subrange
     types.

Thus, unlike Pascal, Modula-2 does not permit strings to be used in
comparisons.  For example:

     VAR s : ARRAY [ 0 .. 3 ] OF CHAR;
     ...
     IF s = "FRED" THEN ...

is not permitted.  However, it is usual for implementations to include
a module providing operations on strings:

     FROM Strings IMPORT CompareStr;
     VAR s : ARRAY [ 0 .. 3 ] OF CHAR;
     ...
     IF CompareStr(s, "FRED") = 0 THEN ...


We will now look at some of the problems that occur with the current
definition of strings.




2.3 Problem 1: Assigning 'short strings' to 'long variables'

The section of the Report concerned with Assignment Statements ($9.1)
gives the following feature of Modula-2:

     A string of length f can be assigned to a string variable of
     length t > f.  In this case, the string value is extended with a
     null character (0C).

[The Report actually uses n1 and n2 rather than f and t; f and t mean
'from' and 'to'.]  Thus:

     VAR s : ARRAY [ 0 .. 3 ] OF CHAR;
     ...
     s := "NW";

is legal and is equivalent to:

     VAR s : ARRAY [ 0 .. 3 ] OF CHAR;
     ...
     s[0] := "N";
     s[1] := "W";
     s[2] := 0C;

Note that the Report leaves the value of s[3] undefined.

There are three points to be made about this feature of
Modula-2.

Firstly, this feature permits array assignments that leave
component(s) of the array undefined.  This appears very strange to






                              - 3 -


those of us who believe in the approach taken by the BS/ISO/ANSI/IEEE
Pascal standard.  Formulating a rigorous definition where (components
of) variables are permitted to have undefined values is possible but
for the reasons spelled out in a paper by Derek Andrews (M2WG-N72),
it would not be very easy.

The second point is that the underlying representation of a variable
that contains a string is given in detail.  And it is not a very
convenient representation: a variable that contains a string is an
array and the string ends at the first occurrence of 0C or at the end
of the array if there is no 0C.  It has been argued that this amount
of detail about the representation of strings should not be given, and
that the chosen representation is neither convenient for computers
where strings are normally null-terminated nor convenient when the
length of the string is stored.

Finally, the value 0C is inappropriate in some character sets.


2.4 Problem 2: What are the types of "A" and ""?

In $3 of the version of the Report given in the first and second
editions of Wirth's book "Programming in Modula-2", there appears the
sentence:

     A single-character string is of type CHAR, a string consisting of
     n>1 characters is of type
          ARRAY [ 0 .. n-1 ] OF CHAR

This means that:

     s := "ABC";

is legal whereas:

     s := "A";

is not if s is declared to be of type ARRAY [ 0 .. 3 ] OF CHAR.  And,
although the syntax allows the literal "", the Report remains quiet
about its type.  However, presumably:

     s := "";

is allowed.

There are similar problems when passing such values to value
parameters.  Consider a procedure whose heading is:

     PROCEDURE DoOperationsUsingStringValue(str:ARRAY OF CHAR);

Although the calls:

     DoOperationsUsingStringValue("FRED");
     DoOperationsUsingStringValue("NW");
     DoOperationsUsingStringValue("");

are permitted, it is not legal to do:






                              - 4 -


     DoOperationsUsingStringValue("A");

Instead one has to duplicate the code of the string handling procedure
in another procedure (presumably called DoOperationsUsingCharValue).

In the third edition of Wirth's book, the sentence from the Report
quoted above has been replaced by:

     A string consisting of n characters is of type
          ARRAY [ 0 .. n-1 ] OF CHAR

The third edition also includes a new sentence in $9.1:

     A string of length 1 is compatible with the type CHAR.

So "A" is now considered to be of the type ARRAY [ 0 .. 0 ] OF CHAR
rather than CHAR.  This means that most of the above problems have
been removed.  However, this new definition seems to imply that the
type of "" is:

     ARRAY [ 0 .. -1 ] OF CHAR

The type of "" is particularly important when "" is passed as an
actual parameter to a value parameter which is an open array
parameter.  For example, if "A" is passed to
DoOperationsUsingStringValue, then the value parameter, str, is of
type ARRAY [ 0 .. 0 ] OF CHAR and thus HIGH(str) would produce 0.  But
what happens when "" is passed?





3. The current position of the BSI's Modula-2 Working Group

At the February meeting of the M2WG, there were (at least) two
opposing points of view:

o    those who thought the holes in the language definition in the
     area of strings could be patched up;

o    those who thought that Modula-2's attempt at strings was
     appalling and that the language had to have a new predefined type
     for strings.

There was no real consensus.  Instead, we decided to follow up both
proposals in further detail and then to seek the views of the Modula-2
community.  The two proposals are presented in $4 and $5 of this
paper.


4. Proposal 1: Patch up the existing definition

  A: If a program assigns a string literal to an array-of-char-
     variable and the literal is "shorter" than the variable, then the
     variable will be padded THROUGHOUT on the right with the
     StringTerminator value.






                              - 5 -


  B: The constant StringTerminator will be defined in a module,
     probably, SYSTEM.  For some implementations, it will have the
     value CHR(0).

  C: The empty string literal is of type 'empty-char' which is
     compatible with the types ARRAY [ 0 .. ^^^ ] OF CHAR.

  D: A string literal of length n (n>0) is of type
     ARRAY [ 0 .. n-1 ] OF CHAR.

  E: A string literal of length 1 is compatible with the type CHAR.

  F: If a formal parameter is a value parameter which is an open array
     parameter and the actual parameter is a string literal then the
     formal parameter will have:

     o    the same type as the actual parameter if the string literal
          is not empty;

     o    the type ARRAY [ 0 .. 0 ] OF CHAR and the value of the
          element will be StringTerminator if the actual parameter is
          an empty string literal.

  G: There will be a module in the Standard library which will provide
     a comprehensive set of procedures for operations on string
     objects.


5. Proposal 2: Add a new predefined string type

  A: A new string type will be added to the language.  In essence,
     this will be similar to that available in UCSD Pascal or in
     Atholl Hay's proposal for ISO Standard Extended Pascal
     (M2WG-N65).

  B: The Modula-2 language will contain ways (e.g., operators or
     standard procedures) to provide the operations of:
          assignment,
          relational operators,
          string <--> char type transfers,
          string <--> character array type transfers,
          denotation for the empty string,
          denotation for string literals,
          operation to return the length of a string,
          operation to concatenate two strings,
          operation to return a substring from a string,
          operation to find the start of a substring within a string.

  C: Other operations will be provided by a module in the Standard
     library since such procedures may be written in terms of the
     language.

  D: If possible, this new string type's definition will be similar to
     that being provided in ISO Standard Extended Pascal.









                              - 6 -


6. Postscript

Please send your comments on these two proposals to me as soon as
possible.  You may find the following electronic mail addresses
useful:

     Barry_Cornelius%mts.durham.ac.uk@UCL-CS.ARPA
     Barry_Cornelius@uk.ac.durham.mts
     bjc@uk.ac.nott.cs