COK2%UK.AC.DURHAM.MTS@AC.UK (Barry_Cornelius) (10/20/86)
All About Strings M2WG-N95: Issue 2: 30th April 1986 Barry Cornelius Department of Computer Science University of Durham Durham DH1 3LE United Kingdom 1. Introduction This paper discusses some of the problems with Modula-2's definition of strings and introduces the two proposals that were discussed at the February meeting of the Modula-2 Working Group of the British Standards Institution. The M2WG would be interested to receive your comments on these proposals. 2. Strings according to the definitions in the Report 2.1 What is meant by 'string'? $3 of the Modula-2 Report defines the term 'string' to mean 'a sequence of characters enclosed in quote marks'. Hence the word 'string' is used to refer to a literal such as "FRED". Throughout $2 of this paper 'string' will be used with this meaning. $3 of the Report also says that a string consisting of n characters is of type: ARRAY [ 0 .. n-1 ] OF CHAR So, a string like "FRED" is of the type: ARRAY [ 0 .. 3 ] OF CHAR 2.2 Simple uses of 'strings' By the normal compatibility rules (defined in $6.3 and $9.1 of the Report), a string like "FRED" is assignment compatible with any variable which is also of its type. Hence, the following is allowed: VAR s : ARRAY [ 0 .. 3 ] OF CHAR; ... s := "FRED"; However, Modula-2 places restrictions on the types of objects which may be compared using the relational operators. In fact, $8.2.4 of the Report says: - 2 - the ordering relations apply to the basic types, INTEGER, CARDINAL, BOOLEAN, CHAR, REAL, to enumerations, and to subrange types. Thus, unlike Pascal, Modula-2 does not permit strings to be used in comparisons. For example: VAR s : ARRAY [ 0 .. 3 ] OF CHAR; ... IF s = "FRED" THEN ... is not permitted. However, it is usual for implementations to include a module providing operations on strings: FROM Strings IMPORT CompareStr; VAR s : ARRAY [ 0 .. 3 ] OF CHAR; ... IF CompareStr(s, "FRED") = 0 THEN ... We will now look at some of the problems that occur with the current definition of strings. 2.3 Problem 1: Assigning 'short strings' to 'long variables' The section of the Report concerned with Assignment Statements ($9.1) gives the following feature of Modula-2: A string of length f can be assigned to a string variable of length t > f. In this case, the string value is extended with a null character (0C). [The Report actually uses n1 and n2 rather than f and t; f and t mean 'from' and 'to'.] Thus: VAR s : ARRAY [ 0 .. 3 ] OF CHAR; ... s := "NW"; is legal and is equivalent to: VAR s : ARRAY [ 0 .. 3 ] OF CHAR; ... s[0] := "N"; s[1] := "W"; s[2] := 0C; Note that the Report leaves the value of s[3] undefined. There are three points to be made about this feature of Modula-2. Firstly, this feature permits array assignments that leave component(s) of the array undefined. This appears very strange to - 3 - those of us who believe in the approach taken by the BS/ISO/ANSI/IEEE Pascal standard. Formulating a rigorous definition where (components of) variables are permitted to have undefined values is possible but for the reasons spelled out in a paper by Derek Andrews (M2WG-N72), it would not be very easy. The second point is that the underlying representation of a variable that contains a string is given in detail. And it is not a very convenient representation: a variable that contains a string is an array and the string ends at the first occurrence of 0C or at the end of the array if there is no 0C. It has been argued that this amount of detail about the representation of strings should not be given, and that the chosen representation is neither convenient for computers where strings are normally null-terminated nor convenient when the length of the string is stored. Finally, the value 0C is inappropriate in some character sets. 2.4 Problem 2: What are the types of "A" and ""? In $3 of the version of the Report given in the first and second editions of Wirth's book "Programming in Modula-2", there appears the sentence: A single-character string is of type CHAR, a string consisting of n>1 characters is of type ARRAY [ 0 .. n-1 ] OF CHAR This means that: s := "ABC"; is legal whereas: s := "A"; is not if s is declared to be of type ARRAY [ 0 .. 3 ] OF CHAR. And, although the syntax allows the literal "", the Report remains quiet about its type. However, presumably: s := ""; is allowed. There are similar problems when passing such values to value parameters. Consider a procedure whose heading is: PROCEDURE DoOperationsUsingStringValue(str:ARRAY OF CHAR); Although the calls: DoOperationsUsingStringValue("FRED"); DoOperationsUsingStringValue("NW"); DoOperationsUsingStringValue(""); are permitted, it is not legal to do: - 4 - DoOperationsUsingStringValue("A"); Instead one has to duplicate the code of the string handling procedure in another procedure (presumably called DoOperationsUsingCharValue). In the third edition of Wirth's book, the sentence from the Report quoted above has been replaced by: A string consisting of n characters is of type ARRAY [ 0 .. n-1 ] OF CHAR The third edition also includes a new sentence in $9.1: A string of length 1 is compatible with the type CHAR. So "A" is now considered to be of the type ARRAY [ 0 .. 0 ] OF CHAR rather than CHAR. This means that most of the above problems have been removed. However, this new definition seems to imply that the type of "" is: ARRAY [ 0 .. -1 ] OF CHAR The type of "" is particularly important when "" is passed as an actual parameter to a value parameter which is an open array parameter. For example, if "A" is passed to DoOperationsUsingStringValue, then the value parameter, str, is of type ARRAY [ 0 .. 0 ] OF CHAR and thus HIGH(str) would produce 0. But what happens when "" is passed? 3. The current position of the BSI's Modula-2 Working Group At the February meeting of the M2WG, there were (at least) two opposing points of view: o those who thought the holes in the language definition in the area of strings could be patched up; o those who thought that Modula-2's attempt at strings was appalling and that the language had to have a new predefined type for strings. There was no real consensus. Instead, we decided to follow up both proposals in further detail and then to seek the views of the Modula-2 community. The two proposals are presented in $4 and $5 of this paper. 4. Proposal 1: Patch up the existing definition A: If a program assigns a string literal to an array-of-char- variable and the literal is "shorter" than the variable, then the variable will be padded THROUGHOUT on the right with the StringTerminator value. - 5 - B: The constant StringTerminator will be defined in a module, probably, SYSTEM. For some implementations, it will have the value CHR(0). C: The empty string literal is of type 'empty-char' which is compatible with the types ARRAY [ 0 .. ^^^ ] OF CHAR. D: A string literal of length n (n>0) is of type ARRAY [ 0 .. n-1 ] OF CHAR. E: A string literal of length 1 is compatible with the type CHAR. F: If a formal parameter is a value parameter which is an open array parameter and the actual parameter is a string literal then the formal parameter will have: o the same type as the actual parameter if the string literal is not empty; o the type ARRAY [ 0 .. 0 ] OF CHAR and the value of the element will be StringTerminator if the actual parameter is an empty string literal. G: There will be a module in the Standard library which will provide a comprehensive set of procedures for operations on string objects. 5. Proposal 2: Add a new predefined string type A: A new string type will be added to the language. In essence, this will be similar to that available in UCSD Pascal or in Atholl Hay's proposal for ISO Standard Extended Pascal (M2WG-N65). B: The Modula-2 language will contain ways (e.g., operators or standard procedures) to provide the operations of: assignment, relational operators, string <--> char type transfers, string <--> character array type transfers, denotation for the empty string, denotation for string literals, operation to return the length of a string, operation to concatenate two strings, operation to return a substring from a string, operation to find the start of a substring within a string. C: Other operations will be provided by a module in the Standard library since such procedures may be written in terms of the language. D: If possible, this new string type's definition will be similar to that being provided in ISO Standard Extended Pascal. - 6 - 6. Postscript Please send your comments on these two proposals to me as soon as possible. You may find the following electronic mail addresses useful: Barry_Cornelius%mts.durham.ac.uk@UCL-CS.ARPA Barry_Cornelius@uk.ac.durham.mts bjc@uk.ac.nott.cs