[comp.lang.modula2] Module String for ISO

aplusl@ethz.UUCP (Albert Meier) (05/31/88)

Expires: 
Sender: 
Reply-To: aplusl@bernina.UUCP (Albert Meier)
Followup-To: 
Distribution: comp.lang.modula2
Organization: ETH Zuerich, Switzerland
Keywords: Modula-2, Library, Strings

Proposal of the Standard Strings Module
---------------------------------------
 
Module Strings provides several procedures for string manipulation.
 
DEFINITION MODULE String;
  CONST
    first = 0;
    last  = -1;
  PROCEDURE Length (VAR str : ARRAY OF CHAR): INTEGER;
  PROCEDURE Pos    (VAR str : ARRAY OF CHAR; from : INTEGER; token :
                    ARRAY OF CHAR; caseSens : BOOLEAN): INTEGER;
  PROCEDURE Insert (VAR str : ARRAY OF CHAR; at : INTEGER; token :
                    ARRAY OF CHAR);
  PROCEDURE Delete (VAR str : ARRAY OF CHAR; start, length : INTEGER);
  PROCEDURE Copy   (VAR str : ARRAY OF CHAR; source : ARRAY OF CHAR;
                    start, length : INTEGER);
  PROCEDURE Diff   (VAR str : ARRAY OF CHAR; start, length : INTEGER;
                    token : ARRAY OF CHAR; CaseSens : BOOLEAN): INTEGER;
 PROCEDURE Append (VAR front : ARRAY OF CHAR; tail : ARRAY OF CHAR);
  END String.
 
Function Length returns the number of characters in str. A terminating
null character is not counted.
 
Function Occurs searches for the occurrence of token in str, starting
with the character at position from. The resulting value corresponds
to the position of the first matched character, or to last if token
does not occur in str. If parameter caseSens is true, upper and lower
case characters are considered as not identical, otherwise they are
considered as identical. Exception to this are all characters with an
ordinal value greater than 127.
 
Procedure Insert inserts string token into string str to the left of
the character at position at. If the resulting string is too long to
be stored, it will be truncated accordingly. If at > Length(str), the
appropriate number of spaces will first be appended to string str. If
at = last, token will be appended to str.
 
Delete deletes length characters from str starting at from. The
subsequent characters close up to fill the gap.
 
Copy copies a substring from source to str. The substring starts at
start and is length characters long. Characters in excess of the last
character in source are ignored. If there is not enough space, Copy
truncates the target string in the same way as Insert.
 
Compare compares a substring of str starting at start and of length
length with the string in token. If the result is negative, token is
less than the substring, if it is zero they are identical, and if it
is positive token is greater than the substring. The absolute value of
the result gives the character position at which the strings differ,
with the first character having the value one.
 
NB:    Parameter str in Length, Occurs and Compare is a variable
       parameter for reasons of efficiency. The string is not altered.
       There is no point in using these functions with string
       constants, since the results are already known.
 
 
 SSS   N   N  V   V      SNV       SWISS STANDARDIZATION BODY
S      NN  N  V   V      149/UK2   Programming Languages
 SSS   N N N  V   V      Chairman  Albert Meier, CH-8906 Bonstetten
    S  N  NN   V V       .    Tel  +41/1/700 30 37
 SSS   N   N    V        . E-mail  aplusl@komsys.ifi.ethz.ch.UUCP

alan@pdn.UUCP (Alan Lovejoy) (06/01/88)

In article <457@ethz.UUCP> aplusl@ethz.UUCP (Albert Meier) writes:
>Reply-To: aplusl@bernina.UUCP (Albert Meier)
>Distribution: comp.lang.modula2
>Organization: ETH Zuerich, Switzerland
>Keywords: Modula-2, Library, Strings

>Proposal of [for?] the [a?] Standard Strings Module
>---------------------------------------

>DEFINITION MODULE String;
>  CONST
>    first = 0;
>    last  = -1;
>  PROCEDURE Length (VAR str : ARRAY OF CHAR): INTEGER;
>  PROCEDURE Pos    (VAR str : ARRAY OF CHAR; from : INTEGER; token :
>                    ARRAY OF CHAR; caseSens : BOOLEAN): INTEGER;
>  PROCEDURE Insert (VAR str : ARRAY OF CHAR; at : INTEGER; token :
>                    ARRAY OF CHAR);
>  PROCEDURE Delete (VAR str : ARRAY OF CHAR; start, length : INTEGER);
>  PROCEDURE Copy   (VAR str : ARRAY OF CHAR; source : ARRAY OF CHAR;
>                    start, length : INTEGER);
>  PROCEDURE Diff   (VAR str : ARRAY OF CHAR; start, length : INTEGER;
>                    token : ARRAY OF CHAR; CaseSens : BOOLEAN): INTEGER;
> PROCEDURE Append (VAR front : ARRAY OF CHAR; tail : ARRAY OF CHAR);
>  END String.

You are using open array parameters.  Unfortunately, open arrays always
have their first index as 0, and the type of the index is considered to
be CARDINAL.  Or perhaps LONGCARD.  Or perhaps either.  I have a
definition-only module named "SystemX" which exports a type INDEX,
which is supposed to be the type of open array indices.  Using this 
type for variables that will be used for indexing open arrays provides
some measure of compiler/machine portability.

Of course, only the implementation section of this module actually needs
to index the open array parameters.  If that were the only issue, then
we could leave it at that.

However, the "string" data type is defined in M2 (by Wirth) as any array
of char whose lower bound is 0--which means that the index type is
either CARDINAL or LONGCARD, depending on the value of the upper bound.

Note that the upper bound can exceed either MAX(INTEGER) or
MAX(LONGINT), but that this module uses INTEGER for its index
parameters.  It is therefore possible to declare strings that this
"generic" string module cannot handle.

I have a definition-only module named Char which exports a constant "maxArray" 
which is the maximum upper bound of an array of char whose lower bound is 0. 
It also exports

  TYPE 
    Index = [0..maxArray];
    MaxArray = ARRAY Index OF CHAR.

For generic string operations, it is advisable to use index types
defined in this way for the sake of portability.  When porting code,
most of the work is then simply redefining a few constants declared in a
small set of definition-only modules dedicated to describing the
idiosyncracies of a particular compiler/cpu.

One would like to conclude that the type used to index open arrays and the 
type used to index strings should therefore be compatible.  In other
words, SystemX.INDEX = Char.Index.  Unfortunately, this cannot be
guaranteed because for most compilers INDEX = CARDINAL but           
Char.Index might be either CARDINAL or LONGCARD.  For consistency,
compilers should implement the language such that INDEX = LONGCARD 
if the LONGCARD type is implemented at all.  Otherwise, some strings
cannot be passed as parameters using open arrays.

Therefore, a generic string module should export its own index type:

  TYPE Index = SystemX.INDEX.

The index parameters of the procedures should be declared with this
type, as should the index variables in the implementation module.
This guarantees that indices will not be passed in that cannot be used with
the open array parameters used in the module's procedures.

>Function Occurs searches for the occurrence of token in str, starting
          ^^^^^^
I assume this means "function Pos"?

>with the character at position from. The resulting value corresponds
>to the position of the first matched character, or to last if token
>does not occur in str. If parameter caseSens is true, upper and lower
>case characters are considered as not identical, otherwise they are
>considered as identical. Exception to this are all characters with an
>ordinal value greater than 127.

Why are you assuming that case (in?)sensitivity is important only for
7-bit ASCII?  Why are you assuming that the character set is ASCII, for
that matter?

It would be preferable to have two different string searching
procedures, one of which is case sensitive and one of which is
not.  The reasons are as follows:

1) Simplicity: there is one less parameter.  It is easier to remember
two different procedure names with identical parameter profiles than it
is to remember one procedure with a fourth BOOLEAN parameter ("Gee, does
TRUE mean case-sensitive or case-insensitive...  And is it arg 3 or arg
4?").

2) Economy: a string search procedure which must work both in a case
sensitive and a case insensitive mode necessarily must implement the
search algorithm twice: once for each mode.  Such a procedure must be
almost twice as big (source and object code) as one which only
implements one mode.  Users who only need one mode should not have
to pay the price of the mode they don't need.  Smart linkers (which
are finally becoming common) mean that procedures that are not imported
will not take up space in the executable.

But there is a more important issue:  the interface to this procedure is
not optimised for the way it should be implemented nor for the way it
would often be used.  The most efficient string search algorithm is
known as the Boyer-Moore (it can be found in many books on data
structures).  The Boyer-Moore starts out by constructing a table
based on the characters in the token being searched for.  This means
that there is a significant overhead cost to starting a new search
unless the table for the search token has already been prepared and is
available.  The interface for Pos guarantees that this will not be the
case.  Since most applications search a wide variety of relatively large
texts for a small number of relatively short tokens, this is a significant
problem.  

Therefore, String should define:

  TYPE

    SearchToken; (* an opaque type *)

  PROCEDURE NewToken(str: ARRAY OF CHAR;
		     VAR token: SearchToken);

  PROCEDURE DisposeToken(VAR token: SearchToken);

  PROCEDURE Search(token: SearchToken;
		   VAR searchArea: ARRAY OF CHAR;
		   VAR index: Index;
		   VAR found: BOOLEAN);

  PROCEDURE CaseInsenSearch(token: SearchToken;
		            VAR searchArea: ARRAY OF CHAR;
		            VAR index: Index;
		            VAR found: BOOLEAN);

Some users will have to suffer more procedure call overhead than they
would with Pos, but that is very small relative to the number of machine
cycles necessary to do the actual search (regardless of the algorithm).
But most importantly, it is always best to optimise for the average 
case.  Most searches will be looking for a token that has been looked
for previously.  Users will initialize such SearchTokens once outside
the inner loops.  Even in the case of a text editor, it is probable that
the text is segmented into lines, screens, pages and/or 64k segments
so that a global search for a token will require calling Search multiple
times with the same user-supplied SearchToken.

> 
> SSS   N   N  V   V      SNV       SWISS STANDARDIZATION BODY
>S      NN  N  V   V      149/UK2   Programming Languages
> SSS   N N N  V   V      Chairman  Albert Meier, CH-8906 Bonstetten
>    S  N  NN   V V       .    Tel  +41/1/700 30 37
> SSS   N   N    V        . E-mail  aplusl@komsys.ifi.ethz.ch.UUCP
> 

-- 
Alan Lovejoy; alan@pdn; 813-530-8241; Paradyne Corporation: Largo, Florida.
Disclaimer: Do not confuse my views with the official views of Paradyne
            Corporation (regardless of how confusing those views may be).
Motto: Never put off to run-time what you can do at compile-time!