aplusl@ethz.UUCP (Albert Meier) (05/31/88)
Expires: Sender: Reply-To: aplusl@bernina.UUCP (Albert Meier) Followup-To: Distribution: comp.lang.modula2 Organization: ETH Zuerich, Switzerland Keywords: Modula-2, Library, Strings Proposal of the Standard Strings Module --------------------------------------- Module Strings provides several procedures for string manipulation. DEFINITION MODULE String; CONST first = 0; last = -1; PROCEDURE Length (VAR str : ARRAY OF CHAR): INTEGER; PROCEDURE Pos (VAR str : ARRAY OF CHAR; from : INTEGER; token : ARRAY OF CHAR; caseSens : BOOLEAN): INTEGER; PROCEDURE Insert (VAR str : ARRAY OF CHAR; at : INTEGER; token : ARRAY OF CHAR); PROCEDURE Delete (VAR str : ARRAY OF CHAR; start, length : INTEGER); PROCEDURE Copy (VAR str : ARRAY OF CHAR; source : ARRAY OF CHAR; start, length : INTEGER); PROCEDURE Diff (VAR str : ARRAY OF CHAR; start, length : INTEGER; token : ARRAY OF CHAR; CaseSens : BOOLEAN): INTEGER; PROCEDURE Append (VAR front : ARRAY OF CHAR; tail : ARRAY OF CHAR); END String. Function Length returns the number of characters in str. A terminating null character is not counted. Function Occurs searches for the occurrence of token in str, starting with the character at position from. The resulting value corresponds to the position of the first matched character, or to last if token does not occur in str. If parameter caseSens is true, upper and lower case characters are considered as not identical, otherwise they are considered as identical. Exception to this are all characters with an ordinal value greater than 127. Procedure Insert inserts string token into string str to the left of the character at position at. If the resulting string is too long to be stored, it will be truncated accordingly. If at > Length(str), the appropriate number of spaces will first be appended to string str. If at = last, token will be appended to str. Delete deletes length characters from str starting at from. The subsequent characters close up to fill the gap. Copy copies a substring from source to str. The substring starts at start and is length characters long. Characters in excess of the last character in source are ignored. If there is not enough space, Copy truncates the target string in the same way as Insert. Compare compares a substring of str starting at start and of length length with the string in token. If the result is negative, token is less than the substring, if it is zero they are identical, and if it is positive token is greater than the substring. The absolute value of the result gives the character position at which the strings differ, with the first character having the value one. NB: Parameter str in Length, Occurs and Compare is a variable parameter for reasons of efficiency. The string is not altered. There is no point in using these functions with string constants, since the results are already known. SSS N N V V SNV SWISS STANDARDIZATION BODY S NN N V V 149/UK2 Programming Languages SSS N N N V V Chairman Albert Meier, CH-8906 Bonstetten S N NN V V . Tel +41/1/700 30 37 SSS N N V . E-mail aplusl@komsys.ifi.ethz.ch.UUCP
alan@pdn.UUCP (Alan Lovejoy) (06/01/88)
In article <457@ethz.UUCP> aplusl@ethz.UUCP (Albert Meier) writes: >Reply-To: aplusl@bernina.UUCP (Albert Meier) >Distribution: comp.lang.modula2 >Organization: ETH Zuerich, Switzerland >Keywords: Modula-2, Library, Strings >Proposal of [for?] the [a?] Standard Strings Module >--------------------------------------- >DEFINITION MODULE String; > CONST > first = 0; > last = -1; > PROCEDURE Length (VAR str : ARRAY OF CHAR): INTEGER; > PROCEDURE Pos (VAR str : ARRAY OF CHAR; from : INTEGER; token : > ARRAY OF CHAR; caseSens : BOOLEAN): INTEGER; > PROCEDURE Insert (VAR str : ARRAY OF CHAR; at : INTEGER; token : > ARRAY OF CHAR); > PROCEDURE Delete (VAR str : ARRAY OF CHAR; start, length : INTEGER); > PROCEDURE Copy (VAR str : ARRAY OF CHAR; source : ARRAY OF CHAR; > start, length : INTEGER); > PROCEDURE Diff (VAR str : ARRAY OF CHAR; start, length : INTEGER; > token : ARRAY OF CHAR; CaseSens : BOOLEAN): INTEGER; > PROCEDURE Append (VAR front : ARRAY OF CHAR; tail : ARRAY OF CHAR); > END String. You are using open array parameters. Unfortunately, open arrays always have their first index as 0, and the type of the index is considered to be CARDINAL. Or perhaps LONGCARD. Or perhaps either. I have a definition-only module named "SystemX" which exports a type INDEX, which is supposed to be the type of open array indices. Using this type for variables that will be used for indexing open arrays provides some measure of compiler/machine portability. Of course, only the implementation section of this module actually needs to index the open array parameters. If that were the only issue, then we could leave it at that. However, the "string" data type is defined in M2 (by Wirth) as any array of char whose lower bound is 0--which means that the index type is either CARDINAL or LONGCARD, depending on the value of the upper bound. Note that the upper bound can exceed either MAX(INTEGER) or MAX(LONGINT), but that this module uses INTEGER for its index parameters. It is therefore possible to declare strings that this "generic" string module cannot handle. I have a definition-only module named Char which exports a constant "maxArray" which is the maximum upper bound of an array of char whose lower bound is 0. It also exports TYPE Index = [0..maxArray]; MaxArray = ARRAY Index OF CHAR. For generic string operations, it is advisable to use index types defined in this way for the sake of portability. When porting code, most of the work is then simply redefining a few constants declared in a small set of definition-only modules dedicated to describing the idiosyncracies of a particular compiler/cpu. One would like to conclude that the type used to index open arrays and the type used to index strings should therefore be compatible. In other words, SystemX.INDEX = Char.Index. Unfortunately, this cannot be guaranteed because for most compilers INDEX = CARDINAL but Char.Index might be either CARDINAL or LONGCARD. For consistency, compilers should implement the language such that INDEX = LONGCARD if the LONGCARD type is implemented at all. Otherwise, some strings cannot be passed as parameters using open arrays. Therefore, a generic string module should export its own index type: TYPE Index = SystemX.INDEX. The index parameters of the procedures should be declared with this type, as should the index variables in the implementation module. This guarantees that indices will not be passed in that cannot be used with the open array parameters used in the module's procedures. >Function Occurs searches for the occurrence of token in str, starting ^^^^^^ I assume this means "function Pos"? >with the character at position from. The resulting value corresponds >to the position of the first matched character, or to last if token >does not occur in str. If parameter caseSens is true, upper and lower >case characters are considered as not identical, otherwise they are >considered as identical. Exception to this are all characters with an >ordinal value greater than 127. Why are you assuming that case (in?)sensitivity is important only for 7-bit ASCII? Why are you assuming that the character set is ASCII, for that matter? It would be preferable to have two different string searching procedures, one of which is case sensitive and one of which is not. The reasons are as follows: 1) Simplicity: there is one less parameter. It is easier to remember two different procedure names with identical parameter profiles than it is to remember one procedure with a fourth BOOLEAN parameter ("Gee, does TRUE mean case-sensitive or case-insensitive... And is it arg 3 or arg 4?"). 2) Economy: a string search procedure which must work both in a case sensitive and a case insensitive mode necessarily must implement the search algorithm twice: once for each mode. Such a procedure must be almost twice as big (source and object code) as one which only implements one mode. Users who only need one mode should not have to pay the price of the mode they don't need. Smart linkers (which are finally becoming common) mean that procedures that are not imported will not take up space in the executable. But there is a more important issue: the interface to this procedure is not optimised for the way it should be implemented nor for the way it would often be used. The most efficient string search algorithm is known as the Boyer-Moore (it can be found in many books on data structures). The Boyer-Moore starts out by constructing a table based on the characters in the token being searched for. This means that there is a significant overhead cost to starting a new search unless the table for the search token has already been prepared and is available. The interface for Pos guarantees that this will not be the case. Since most applications search a wide variety of relatively large texts for a small number of relatively short tokens, this is a significant problem. Therefore, String should define: TYPE SearchToken; (* an opaque type *) PROCEDURE NewToken(str: ARRAY OF CHAR; VAR token: SearchToken); PROCEDURE DisposeToken(VAR token: SearchToken); PROCEDURE Search(token: SearchToken; VAR searchArea: ARRAY OF CHAR; VAR index: Index; VAR found: BOOLEAN); PROCEDURE CaseInsenSearch(token: SearchToken; VAR searchArea: ARRAY OF CHAR; VAR index: Index; VAR found: BOOLEAN); Some users will have to suffer more procedure call overhead than they would with Pos, but that is very small relative to the number of machine cycles necessary to do the actual search (regardless of the algorithm). But most importantly, it is always best to optimise for the average case. Most searches will be looking for a token that has been looked for previously. Users will initialize such SearchTokens once outside the inner loops. Even in the case of a text editor, it is probable that the text is segmented into lines, screens, pages and/or 64k segments so that a global search for a token will require calling Search multiple times with the same user-supplied SearchToken. > > SSS N N V V SNV SWISS STANDARDIZATION BODY >S NN N V V 149/UK2 Programming Languages > SSS N N N V V Chairman Albert Meier, CH-8906 Bonstetten > S N NN V V . Tel +41/1/700 30 37 > SSS N N V . E-mail aplusl@komsys.ifi.ethz.ch.UUCP > -- Alan Lovejoy; alan@pdn; 813-530-8241; Paradyne Corporation: Largo, Florida. Disclaimer: Do not confuse my views with the official views of Paradyne Corporation (regardless of how confusing those views may be). Motto: Never put off to run-time what you can do at compile-time!