[net.lang.mod2] A Long Note on the Trichotomous String Comparison Example

PATTIS@WASHINGTON.ARPA (Richard Pattis) (07/09/86)

		 A Book Notice and Problem Statement


A New Modula-2 Book:

  I recently received a copy of a new Modula-2 book.  It was written by Arthur
Sale of the Univeristy of Tasmania in Australia, and it is titled "Modula-2:
Discipline & Design".  It is published by Addison-Wesley and costs about $25.
I'll quote a section from the preface to show what I think is the flavor of
this book (all typos are my own):

    "My purpose in writing this book is to bring together a precise
     discussion of Modula-2 and its features and modern concepts of
     professional large program writing.  There are few books which are
     accessible to learning students which use Modula-2 at the present
     time, and this is an inhibiting factor in its use as a teaching
     language.  However, even more important in this book is the break
     with the traditional approaches to first language texts.  It is my
     experience over ten years that learning students can cope with
     concepts of considerable abstraction, such as predicate transformers
     and invariants.  Not all the implications are realized by students,
     but the concepts can be planted and used.  The best students progress
     much faster, but even average students see the ideas which are presented
     as a result of modern computer science research as converting the dark
     art of programming into something which is understandable.  The era of
     learning programming by osmosis, which has persisted for far too long,
     is ending, and programming has become a subject in which design and
     discipline reign and can be taught"


A Modula-2 Problem:

  In Sale's discussion of strings, he decides to store each string padded with
0Cs (not just one, but 0Cs all the way up thru the last element in the array).
Of course, if the string fills the array, there is no padding at all.  Also,
note that the string terminator, 0C, compares less than any other character.
Given any two parameter strings that follow this convention, Sale writes the
following function to compare them (I've tried to transcribe the code directly
from this introductory text, using the same indentation and spacing):

(1) I assert, that this function computes wrong answers.  In what cases does
    it do so?  Be as general as possible and remember the assumptions above.

(2) Describe how we could rewrite this function to correct this deficiency.

(3) Given the same assumptions, completely rewrite this function to be as
    simple as possible. It should also execute as quickly as possible.

(4) Now, remove the padding assumption; instead, assume that only one 0C is
    placed at the end of each string that doesn't fill its array.  Rewrite
    this function to work correctly with strings following this convention.

(5) What interesting conclusions can you draw from this exercise?

  For parts 3 & 4, use your best programming style.  Write these function as
clearly as possible.

  Please send your answers to me, PATTIS@WASHINGTON.  After you send me your
analysis, I'll send you mine.  I'll also keep a compendium of the analyses
that I receive, and send it out to the mailing list when responses stop.  If
you don't want your analysis shown to anyone, or if you don't want your name
attached to it, please tell me when you mail in your analysis.  If there is
enough interest in this exercise, I'll construct a similar one.


  TYPE Order = (Before,Same,After);


  PROCEDURE CompareStrings (S1: ARRAY OF CHAR;  S2 : ARRAY OF CHAR) : Order;
  CONST
    NUL = 00C;  (* Terminating Marker *)
  VAR
    i : CARDINAL;
    Shortest : CARDINAL;
    Result : Order;
    ResultKnown : BOOLEAN;
  BEGIN
    Shortest:= Min(HIGH(S1),HIGH(S2));
    ResultKnown:= FALSE;
    i:= 0;
    WHILE NOT ResultKnown DO

       IF S1[i] < S2[i] THEN
            (* Found unmatched character *)
            Result:= Before;
	    ResultKnown:= TRUE;
       ELSIF S1[i] > S2[i] THEN
            (* Found unmatched character *)
            Result:= After;
	    ResultKnown:= TRUE;
       ELSE (* S1[i] = S2[i] *)
           IF S1[i] = NUL THEN
	        (*   Both strings same up to two NULs. *)
                Result:= Same;
		ResultKnown:= TRUE;
	   ELSE
	        (*   Task unresolved, must look at next component. *)
	       i:= i+1;
	       IF i > Shortest THEN
		      (*   Run out of one string's components. *)
	       	      IF HIGH(S1) = HIGH(S2) THEN (* Both together. *)
		      	   Result:= Same;
	       	      ELSIF Shortest = HIGH(S1) THEN (* S1 shorter *)
		      	   Result:= Before;
	       	      ELSE (* S2 shorter *)
		      	   Result:= After;
		      END;
		      ResultKnown:= TRUE;
	       END;
	   END;
       END (* of IF comparing S[i]'s *);

    END (* of WHILE *);
    RETURN Result;
END CompareStrings;
      Coding & Correctness: A Complex String Processing Example


  Please excuse the way that this note is written; I've been writing so long
for underclassmen, I may have forgotten how to address the experts. I actually
wrote this material for students in my second quarter programming course.


(1) The CompareString function returns an incorrect answer when comparing S1
(a string filling its array) with S2 (a string equal to S1, but stored in a
larger array; e.g., S2 is terminated by at least one 0C).  The same is true
when the properties of S1 and S2 are switched.  For both these inputs, the
CompareString function returns Before (After) instead of Same: the nested test
HIGH(S1) = HIGH(S2) needs to determine whether the two strings are the same
size -  but instead, it only checks whether the two arrays are the same size.

(2) CompareString could be corrected by adding more code to check for special
cases.  This added code would make a complex function even more complicated.
Instead of fixing the "minor" problem here (and who knows what other "minor"
problems would then crop up), I felt that the whole function could instead be
rewritten much more simply. This was the gut feeling that motivated the entire
problem.  When teaching introductory programming, I quote to my students the
following saying: "Bad programmers fix errors by adding corrective code.  Good
programmers fix errors by removing defective code."

(3) When one writes a simpler version of CompareStrings, one finds that the
idea of extra padding is unnecessary,  Only in the original version of the
CompareStrings function does padding simplify coding.  Therefore, see the
next paragraph for a solution to this problem too.

(4) When I teach string processing in Modula-2, I first show my students the
following High function (which is analogous to the built-in HIGH function). It
takes O(N) to execute High - I assume that HIGH is O(1) - but it supplies its
information about a string (as opposed to information about the array in which
the string is stored).

  PROCEDURE High (S : ARRAY OF CHAR)  : INTEGER;
  VAR I : INTEGER;
  BEGIN
    FOR I:=0 TO HIGH(S) DO
      IF S[I] = 0C
        THEN RETURN I-1 END
    END;
    RETURN HIGH(S)
  END High;

  Given this High function, and trichotomous versions of CompareChars and
CompareIntegers, I wrote a much simpler version of CompareStrings. These three
functions don't appear in my final solution, but I used them to help me write
a simple, correct piece of code (which I then had enough confidence in to try
to optimize).

  PROCEDURE CompareStrings (S1, S2 : ARRAY OF CHAR)  : Order;
  VAR I : INTEGER;
  BEGIN
    FOR I:= 0 TO MIN(High(S1),High(S2)) DO
      CASE CompareChars(S1[I],S2[I]) OF
         Before : RETURN Before
       | After  : RETURN After
       | Same   : (* continue scanning *)
      END
    END;
    RETURN CompareIntegers(High(S1),High(S2))
  END CompareStrings;
  My final solution to this problem follows a similar algorithm.  It pushes an
approximate calculation of each High function inside the FOR loop: if a 0C is
reached in either S1 xor S2, then that string is smaller; if 0C is reached in
both S1 and S2, the strings must be equal.  The FOR loop exits (as opposed to
returns) only if all elements thru HIGH of the smaller array are identical to
all equivalent elements in the larger (or equal-sized) array.  The final three
statements handle the special case checking; they determe whether one string
is truly longer than the other. Notice that at most one (and maybe neither) of
the final IF statements may be TRUE, because MinH is >= at least one HIGH.

  PROCEDURE CompareStrings (S1, S2 : ARRAY OF CHAR)  : Order;
  VAR I,MinH : INTEGER;
  BEGIN
    MinH:= MIN(HIGH(S1),HIGH(S2));		  (* Or, expand into an IF  *)
    FOR I:= 0 TO MinH DO
      IF S1[I] < S2[I] THEN RETURN Before END;
      IF S1[I] > S2[I] THEN RETURN After  END;
      IF S1[I] = 0C    THEN RETURN Same   END	  (* If here, S1[I] = S2[I] *)
    END;

    IF (MinH < HIGH(S2) AND (S2[MinH+1] <> 0C) THEN RETURN Before END;
    IF (MinH < HIGH(S1) AND (S1[MinH+1] <> 0C) THEN RETURN After  END;
    RETURN Same
END CompareStrings;

  I'm assuming that a good optimizing compiler will cache the values of S1[I]
and S2[I] inside the loop so that these arrays are actually accessed only once
per loop iteration.  If not, I can always declare two local variables C1 & C2
of type CHAR, and assign them the values of S1[I] and S2[I] respectively as
the first two statements inside the loop.

  Notice that if all strings ended in at least one 0C, then the entire block
of code following the FOR loop could be removed.  In fact, the FOR loop could
be turned into a LOOP incrementing I from 0 upward, avoiding the need for a
MinH variable and any evaluation of HIGH.  Finally, if we never needed to
compare string literals, we could change both the parameter modes to VAR and
avoid the copying (although this advantage may be offset by an increase in the
time needed for each array access).  Further transformations for efficiency
gains may be very dependent on the compiler and machine architecture used.

  There is a slight oxymoron in question (3), where I ask for the simplest and
most efficient code. Often these qualities oppose each other; but for the most
part (ruling out exotic algorithms), the simplest code is (or can often be
made, by something as simple as an optimizing compiler) the most efficient.


(5) On teaching programming: First, I'd like to quote dircectly what Sale says
about the implementation of StringCompare, after showing the code (page 266).

  "The procedure design was kept straightforward and provably correct on the
   grounds that this provides the best illustration for learners...The complex
   IF tests which determine the result when the procedure reaches the end of
   one or both strings is very rarely executed.  (Of course, this makes its
   provable correctness even more important, because it may not be tested
   carefully!)"

I would like to suggest that in programming, the search for simplicity should
take precedence over the search for a "proof of correctness".  Too many times
I have written a complex algorithm and spent time trying to prove it correct
(and sometimes succeeding, even when the algorithm was wrong), rather than
spending my time looking for a simpler algorithm.  This is the standard penny
wise, pound foolish approach to reality.  Unfortunately, when one searches for
a complex proof of correctness, one often finds it too easily and in error. I
would rather find a simple algorithm that is easily explained than a proof.
  For introductory programming classes, I try to emphasize the teaching of
critical thinking.  I attempt teach techniques for trying to prove that code
is incorrect (such an activity may be labeled as aggressive debugging - even
in the absence of manifest bugs).  Mathematically proving and disproving code
are identical, but in training future programmers, I would much prefer that my
students say, "I haven't found anything wrong with my code yet," to the all
too frequently heard, "It's correct." Of course, it's easier to spot errors in
someone else's code, so one good way to teach this skill is to ask students to
analyze code written by others (known to be wrong) and find any errors.  They
will enjoy  this exercise more if the code has been published somewhere.

  As an exerci
se, examine your favorite Pascal book for the code implementing
some foundational algorithms in computer programming: prime testing, binary
array searching, an O(N^2) sort, Quicksort, etc.  Ask yourself first whether
the code is correct; next ask yourself whether the code is presented as simply
and clearly as possible; could you do better using Modula-2?  Remember, these
examples are used to introduce students to programming; they will be studied
by tens of thousands of students and become standards for comparision.


  On formalism: I believe that formalism is useful only when it can be applied
to extend what we can do, or accurately check what we have done. Formalism for
its own sake is pedantry.  For example, I advocate the use of simple control
structures such as Pascal's, which subscribe to the 1-in/1-out rule.  Yet in
Modula-2 we may violate this rule by using LOOP with EXIT and subroutines that
use RETURN.  I use these features to simply implement simple algorithms: often
such code is easier to understand, because it omits the introduction and use
of extraneous state variables and other circuitous control flow.

  Here is an example function that appears in most introduction to programming
books. It performs a linear search in array A for the index of Key; it returns
the first index of Key in A (if such an index exists); otherwise it returns
Default.  I'll first show the Modula-2 solution that I teach.

  PROCEDURE Search1 (VAR A   : ARRAY OF INTEGER;	      (* Efficiency *)
  		     Key     : INTEGER;
		     Default : INTEGER)           : INTEGER;
  VAR I : INTEGER;
  BEGIN 
    FOR I:=0 TO HIGH(A) DO
      IF A[I] = Key	
        THEN RETURN I END		      (* Found first I: A[I] =  Key *)
    END;
    RETURN Default			      (* 0<=I<=HIGH(A): A[I] <> Key *)
  END Search1;


Here is the simplest solution that I know adhering to the 1-in/1-out rule.

  PROCEDURE Search2 (VAR A   : ARRAY OF INTEGER;	      (* Efficiency *)
  		     Key     : INTEGER;
		     Default : INTEGER)           : INTEGER;
  VAR I,AnswerIndex : INTEGER;
  BEGIN 
    I:= 0;
    WHILE (I <= HIGH(A)) AND (A[I] <> Key) DO
      I:= I+1
    END;
    IF I <= HIGH(A)			 (* Determine why WHILE loop exited *)
      THEN AnswerIndex:= I		 (* Found first I: A[I]  = Key      *)
      ELSE AnswerIndex:= Default	 (* 0<=I<=HIGH(A): A[I] <> Key      *)
      END;
    RETURN AnswerIndex
  END Search2;
  Some comments on these solutions: Search1 uses two RETURN statements, one in
the FOR loop. Many programmers consider the premature RETURN from the FOR loop
to be bad programming practice.  Search2 has one complex test instead of two
simpler ones (actually, Search1 hides one test in the FOR's semantics).  The
two tests specify under what condition the loop should continue, the first
specifies that the array end has not been reach, the second specifies that the
Key has not been found; I find both these "negative" tests difficult to under-
stand (is there a better solution using REPEAT?). In Search2, we must remember
to include code to initialize, increment, and test the loop index. In Search2,
we must disambiguate the loop exit condition in a seperate IF statement after
exiting the loop; even in this IF we cannot perform the easier-to-understand
"positive" test A[I] = Key (this inability is a direct result of short-circuit
boolean expression evaluation, discussed below).  Finally, in Search2 we need
to introduce an extra state variable AnswerIndex (or worse, reuse the variable
I).  If this second example is upsetting, now remove the ability to perform
short-circuit boolean evaluation (SCBE) and rewrite this function in Pascal.
Some programmers may argue that SCBE should be disallowed for similar reasons
of formalilty, because it doesn't support the intuitive semantics for boolean
expressions (commutivity); in the example above, we must very carefully select
our post-loop IF test.

  OK, I lied: here is a simpler Search2 example.  Other people use a variant
that replaces AnswerIndex with a boolean variable indicating whether the key
element has been found (such a solution must set I to the correct return value
either inside the loop or after it exits).  This style of coding also works in
Pascal.  I still think that Search2 is more complicated and more difficult to
understand than Search1: for example, I think that it is un-intuitive to set
AnswerIndex to Default before the loop code executes.  Also, this code fails
if the first occurrence of Key is in A[Default], and it is followed by another
occurence of Key. There is much subtlety is this code. Got a better solution?

  PROCEDURE Search2 (VAR A   : ARRAY OF INTEGER;	      (* Efficiency *)
  		     Key     : INTEGER;
		     Default : INTEGER)           : INTEGER;
  VAR I,AnswerIndex : INTEGER;
  BEGIN 
    AnswerIndex:= Default;
    I:= 0;
    WHILE (I <= HIGH(A)) AND (AnswerIndex = Default) DO
      IF A[I] = Key
        THEN AnswerIndex:= I END;	 (* Found first I: A[I]  = Key      *)
      I:= I+1
    END;
    RETURN AnswerIndex
  END Search2;

  Here again I have abstracted some quotations from Sale's book (page 401).  I
have numbered the statements for easier reference later:

  "(1) Program fragments containing EXIT statements or unusually placed RETURN
   statements which are not at the termination points of their procedures
   (call these 'early returns') have a flow and execution which is not easy to
   understand....(2) Proof techniques or constructively correct design proce-
   dures are not easily applied to these constructs because they have complex
   semantic definitions....(3) Reading programs with EXIT statements and early
   returns can be difficult, especially if long procedure bodies are involved.
   Finding all the relevant embedded statements and in the case of EXITs
   determining which loop they belong to will require visual parsing of the
   text....(4) Other reasons for the constructs usually put forward are either
   convenience or efficiency. Both arguments are at odds with the primary aims
   of Modula-2, which are clarity, abstraction, and correctness....(5) Most
   programmers who are taught programming in accordance with modern computer
   science concepts have thought patterns which never evoke them."
(1) I believe that the flow of my CompareStrings and Search1 procedures are
easier to understand than the original CompareStrings or either of the Search2
procedures presented. Understandability isn't an inherent quality of a control
construct; it is a quality of how a control construct is used. I often find it
more natural to "return early"; avoiding such a control flow often reduces the
readability of my code. (Also see the whole GOTO controversy.)

(2)  Formalists may like to ban such programming constructs because they have
poor proof rules and can be abused. Yet the controlled use of these constructs
often allows programmers to write terse, correct code.  I prefer code that is
easily understandable to code that is easily provable (meaning code that uses
constructs with cleaner semantics).  To me, the ultimate test for using a
specific construct is can I, as a psychological entity, understand its meaning
better with or without using that construct. I always try to compare code that
uses a construct with code that omits it and uses alternatives.  It is still
possible to write terrible code while adhering to the 1-in/1-out constructs.

(3) In big subroutines, like the original StringCompare function, EXITs and
RETURNs may be visually lost (a clear commenting style would help); but in a
small subroutine (and it is my goal to make all my subroutines small), such
complex features can be used in simple ways to increase the clarity of the
code.  Here the author is advocating use based soley on syntactic advantages.

(4) To me it is often convenient to use such coding practices because they are
the natural, most clearly understood alternatives.  In such a case, I believe
that I am not at odds with Modula-2's "primary aims".  I also beleive that the
aim of efficiency is part of Modula-2 also.

(5) I believe that I am being accussed of sloppy thinking here.  The question
to me is whether one believes that formalism is the highest goal to aspire to;
whether one can follow some rules and be guaranteed to write beautiful code. I
prefer to let my thought patterns run wild. Rather than strive for formality,
I strive for simplicity (and find it in the darndest places).

  In summary I think that formalism, when correctly applied, is wonderful. But
formalism works best when applied to simple code. Before jumping into a formal
analysis of code, I like to try to ensure that the code is written as simply
and as easy to understand as possible. This precondition increases the chances
that further formal analysis will supply insight.


  On language design: By specifying two different ways in which strings can be
represented in Modula-2, Wirth has traded increased storage efficiency for
decreased algorithm clarity (and efficiency). A specification that each string
must be terminated by a 0C would make most string processing operations much
simpler (and therefore easier to write and understand), more efficient, and
less prone to errors.


  On Sale's book: I plan to use it in my introduction to programming classes
next year; apart from my negative comments here, this book is the best intro-
ductory text to Modula-2 that I have seen.


  The following pages contain the submissions that I received from the five
responders to my original message.  I have edited and annotated these sub-
missions.  I will post the complete text of all messages in another note.
Finally, when I analyzed these submissions, I had all five in front of me,
as well as my own solution.  The five individual authors did not have this
advantage.  If you have an comments on this note, please send them to me.
		 The Annotated Contributed Solutions


Tom Rodeheffer of DEC SERC correctly identified the error as follows:

  "Sale's algorithm fails when two equal strings are stored in arrays of
   different sizes, one array being exactly the length of the string, so
   that no padding is required, the other being longer, so that at least
   one padding character is present.  Sale's algorithm will say that the
   string stored in the shorter array compares before the other.  The
   problem occurs when the algorithm reaches the end of the shorter string
   without finding a difference or a string terminator.  The algorithm
   neglects to check if the longer string has a string terminator in the
   next character position."

He contributed the following beautiful piece of code (although I prefer LOOP
to WHILE TRUE - picky, picky).

  PROCEDURE CompareStrings (s1, s2 : ARRAY OF CHAR)  : Order;
  VAR
      i  : CARDINAL;
      c1 : CHAR;
      c2 : CHAR;
  BEGIN
      i := 0;
      WHILE TRUE DO
          IF i <= HIGH(s1) THEN c1 := s1[i] ELSE c1:= 0C END;
          IF i <= HIGH(s2) THEN c2 := s2[i] ELSE c2:= 0C END;
	  IF c1 < c2 THEN RETURN Before END;
	  IF c1 > c2 THEN RETURN After END;
	  IF c1 = 0C THEN RETURN Same  END;
	  i := i + 1;
       END;
END CompareStrings;

He then perspicuously describes why his algorithm works:

  "This algorithm views each array as if it were followed by an endless source
   of string terminators.  Thus, under this view, strings always end with a
   string terminator.  The terminator will come either from the array of from
   the view.  Given these conditions, it is easy to compare the strings."

Next, he discusses optimizations (where he leads the way to my solution).

  "The algorithm may be made more efficient by writing one section of code
   to handle character positions that occur in both arrays, and another
   section to handle the next character position, which does not occur in
   both arrays.  However, this makes the algorithm more complicated and
   harder to understand."

He goes on to say that if every string were guaranteed to be terminated by a
0C, the first two IF statments inside the loop could be simplified to straight
array assignments, without any loss of correctness.  His conlcusion:

  "It is interesting how easy it is to stumble over special cases, even for
   people who work with invariants and other program proof techniques."


My conclusion: I like Tom's algorithm more than mine.  It is simpler to read,
understand, and prove correct (in both an intuitive and formal sense).

Jonathan Brown (brown#jon%mfe@lll-mfe.arpa) also correctly identified the
error, which he described as follows:

  "Sale's function will return the wrong result when the following three
   conditions arise:
     1) the two strings are identical
     2) the lengths of the arrays containing the strings differ
     3) the string in the shorter array uses up the entire array"

He also realized that the assumption of multiple 0C padding was unnecessary,
so he submitted a revised solution for both parts 3 & 4 with the following
preface:

  "For my solution, I'm going to be lazy and just fix the bug in Sale's
   function. If there is a simpler, quicker routine, it is not obvious to me."

Here is the relevant part of his solution

               IF i > Shortest THEN
                       (*   Run out of one string's components. *)
                       IF HIGH(S1) = HIGH(S2) THEN (* Both together. *)
                            Result:= Same;
                       ELSIF (Shortest = HIGH(S1)) & (S2[i] # NUL) THEN
                            (* S1 shorter, but strings not equal *)
                            Result:= Before;
                       ELSIF S1[i] # NUL THEN
                            (* S2 shorter, but strings not equal *)
                            Result:= After;
                       ELSE
                            Result:= Same;
                       END;
                       ResultKnown:= TRUE;
               END;

Interestingly enough, this correction doesn't completely solve the problem.  I
assert that this function will yield an execution error under exactly the
conditions stated above (with S1 being the completely filled array).  If such
is the case, the first conjunct in the ELSIF is TRUE, but the second is FALSE;
therefore, the second ELSIF test is evaluated.  The access S1[i] is illegal
when Shortest = HIGH(S1) and i > Shortest.  It seems as if the solution should
be more symmetrical in its predicates.  The modification to this code that
yields a correct solution is shown at the end of this message. If you have not
yet derived the variation of the solution, try now before reading further.

Finally, he closes with the following statements:

  "There is a bug in Sale's function that results in an erroneous value
   returned in certain (probably rare) circumstances.  My solution looks
   bug-free to me, but I haven't tested it so I won't swear by it."

I find this quote intriguing.  No doubt Sale tested his function extensively.
The probability of finding the "incorrect cases" is very low.  But even after
Jonathan Brown identified these cases by pure deductive logic, he didn't test
them using his solution (actually, there are two cases S1 being filled and S2
being filled - only the first causes the execution error).  Given that Sale's
function was probably extensively tested, why does Jonathan end his submission
with the an appeal to testing?

Erik Jacobsen (mcvax!daimi!erja@seismo.CSS.GOV) didn't actually state inputs
that would cause the function to fail, but he identified Sale's misconception:

  "The problem is that HIGH(...) is the declared length of the ARRAY, not the
   actual length of the string. One way to patch it could be to start finding
   the actual length of each string, and use these lengths instead of
   HIGH(...)"

He also realized that the assumption of multiple 0C padding was unnecessary,
so he submitted a revised solution for both parts 3 & 4.  In addition, he
spotted a major simplification; note that CharPos is a functionalized version
of Tom Rodeheffer's "viewing solution".

  "Rewriting the function to be simpler, I first observe that we have two
   criteria to indicate the length of a string - either the ARRAY is filled
   or it is padded with NULs. I would prefer a simpler representation, e.g.
   that there always will be a NUL at the end of a string.  One way to
   accomplish this can be to look at a string through this function:

    PROCEDURE CharPos(S: ARRAY OF CHAR; Pos: CARDINAL): CHAR;
    BEGIN
      IF Pos > HIGH(S) THEN
        RETURN NUL
      ELSE
        RETURN S[Pos]
      END
    END;

He then finishes writing his solution (with a justification) by showing how
CharPos can be used within CompareStrings:

  "Rewriting CompareStrings as it stands is fairly simple. Use CharPos(Sx,i)
   instead of Sx[i] - or declare som CHAR-variables. As you will always end
   up with NUL, you can erase the code following IF i>shortest. CompareStrings
   still looks funny though - it's not easy to check the correctness. My way
   of doing it is:"

  PROCEDURE CompareStrings (s1, s2 : ARRAY OF CHAR)  : Order;
  VAR
      i  : CARDINAL;
      c1 : CHAR;
      c2 : CHAR;
  BEGIN    i := 0;
    REPEAT
      Ch1 := CharPos(S1,i); Ch2 := CharPos(S2,i);
      i := i + 1
    UNTIL (Ch1 <> Ch2) OR (Ch1 = NUL);

    IF    Ch1 < Ch2 THEN
      RETURN Before
    ELSIF Ch1 > Ch2 THEN
      RETURN After
    ELSE
      RETURN Same
    END;
  END CompareStrings;

Here Erik has retained the 1-in/1-out style of programming (except for the 3
RETURNs at the), placing the final disambiguation tests after the end of his
loop.  Note that this solution requires only two character comparisions per
loop (less than mine or Rodeheffer's).  He finally states that:

  "I haven't considered efficiency (and I usually don't care)."
Erik concludes with a set of annotated quotatations from Kernighan & Plauger,
"The Elements of Programming Style", which he says Sale has violated in his
function.  He follows these quotes with a rule of his own:

  "Write clearly - don't be to clever".
  "Write clearly - don't sacrifice for efficiency".
    Whatever the reasons were, the original function wasn't written clearly.
  "Choose a data representation that makes the program simple".
  "Terminate input by end-of-file or marker - not by count".
    Having two conditions for ending a string will make a program complicated.
  "Don't patch bad code - rewrite it".
    As was suggested in (3) and (4).

 A rule of my own: "A simple problem must have a simple solution," so keep re-
 writing your code until it doesn't look more complicated than the problem.

Juerg Wanner (mcvax!cernvax!owf!zjwa@seismo.CSS.GOV) starts his submission:

  "Not only will this (Sale's) procedure produce wrong results, but eventually
   the program using this procedure will crash due to a runtime error (index i
   can get larger than Min(HIGH(S1), HIGH(S2)) and so the procedure attempts
   to access a non-existing array element. A proper modula-2 implementation
   will crash there (unless run-time checks have been turned off)."

I must admit that I don't see how this can happen: yes, index i can exceed
Shortest, but an index with this value is never used to access either array.
(Someone please supply me with a simple example that exhibits this error.)

Juerg next supplies the following efficient CompareStrings function; he also
comments on his use of VAR parameters, but forgets the case of passing string
constants:

  "The parameters are declared as VAR parameters because of speed. There is
   no need to copy the strings in order to do a comparison. Of course this is
   bad programming style."

  PROCEDURE CompareStrings(VAR s1, s2 : ARRAY OF CHAR) : Comparison;
  VAR i, high : CARDINAL;
  BEGIN
    i := 0;
    high := Min(HIGH(s1), HIGH(s2));
    WHILE (i < high) & (s1[i] = s2[i]) & (s1[i] # 0C) & (s2[i] # 0C) DO
      INC(i);
    END;
    IF    s1[i] < s2[i] THEN RETURN less;
    ELSIF s1[i] > s2[i] THEN RETURN greater;
    ELSIF HIGH(s1) = HIGH(s2) THEN RETURN equal;
    ELSIF high = HIGH(s1) THEN
      IF s2[i+1] = 0C THEN RETURN equal;
      ELSE RETURN less;
      END;
    ELSE
      IF s1[i+1] = 0C THEN RETURN equal;
      ELSE RETURN greater;
      END;
    END;
  END CompareStrings;

Picky-point: the WHILE test suffix (s1[i] # 0C) & (s2[i] # 0C) is redundant;
to get here with SCBE, we know that s1[i] = s2[i], so we need only test one
of these elements for inequality with 0C.

This CompareStrings procedure again uses the 1-in/1-out style of programming
(except for the 7 RETURNs).  It is much in the same style as my solution, but
the omission of explicit RETURNs from the body of the loop causes an increase
is special-case checking after the loop.  The Rodeheffer/Jacobsen solutions
totally omit these "complex after the loop" checks via their viewing property.

After a second look at this function (the 7 RETURNs looked too asymmetic, a
sign that there may be something wrong), I found that indeed this function
contained a subtle error. Look at the case where the two strings s1 and s2 are
identical: both are short strings (ended by 0C) in very long arrays.  Finally
assume that the two actual arrays sizes are different.  The WHILE loop will
exit when both 0Cs are reached (with the index i addressing each 0C).  The
first IF and ELSIF tests are FALSE (s1[i]=s2[i]=0C), and so is the next ELSIF
(the arrays are of unequal lengths).  Assume that the array containing s1 is
the smaller one; so, high = HIGH(S1).  In this case we compare s2[i+1] against
0C; but this comparision is totally bogus, because s2[i+1] specifies the first
character BEYOND THE 0C in string s2.  Thus, the result of the returned value
is random.  The same is likewise true if high = HIGH(s2).

Bob Campbell (fluke!tikal!bobc@SUN.COM) also correctly identified the error,
which he described as follows:

  "The answer that I have is two strings which are the same but one fills up
   the full string and the second is shorter then a full array when this
   happens the string which fills up the whole array will be less then the one
   which does not."

His modification to Sale's original program is given by:

    IF i > Shortest THEN
	(* Run out of one string's components. *)
	IF HIGH(S1) = HIGH(S2) THEN (* Both together. *)
	    Result := Same
	ELSIF Shortest = HIGH(S1) THEN (* S1 shorter *)
	    IF (S2[i] = NUL) THEN
		Result := Same
	    ELSE
		Result := Before
	    END
	ELSE	(* S2 shorter *)
	    Result := After
	END;
	ResultKnown := TRUE
    END

Again, as in the previous "simple alteration", the code is still not correct.
It only goes halfway to the solution. If the "standard conditions" hold for S1
and S2 (with S1 being 0C ended and placed in the larger array), this function
returns After, when in fact the two strings are ths same.  Again, the symmetry
of the HIGHs is missing from the alteration.

Bob completely rewrites a final version for efficiently comparing Modula-2
strings as follows (here the symmetry is present):

    PROCEDURE CompareStrings(S1,S2:ARRAY OF CHAR) : Order;
    CONST
	NUL = 0C; (* Terminating Marker *)
    VAR
	i : CARDINAL;
	Shortest : CARDINAL;
    BEGIN
	Shortest := Min(HIGH(S1),HIGH(S2));
	i := 0;
	(* As long as the strings are the same check  and the end of
	 * neither of the strings have been reached increment to the
	 * next.
	 *)
	WHILE i <= Shortest AND S1[i] = S2[i] AND S1[i] # NUL DO
	    i := i + 1
	END;
	IF (i > Shortest) THEN
	    (* We are at the end of one of the strings *)
	    IF HIGH(S1) = HIGH(S2) THEN
		(* Strings are the Same length *)
		RETURN(Same)
	    ELSIF HIGH(S2) > Shortest THEN
		(* S2 is longer *)
		IF S2[i] = NUL THEN
		    (* Both strings terminate at the same place *)
		    RETURN(Same)
		ELSE
		    RETURN(Before)
		END
	    ELSE
		(* S1 is longer *)
		IF S1[i] = NUL THEN
		    (* Both strings terminate at the same place *)
		    RETURN(Same)
		ELSE
		    RETURN(After)
		END
	    END
	(* ASSERT at this else S1 can't equal S2 or both = NUL *)
	ELSIF S1[i] < S2[i] THEN
	    RETURN(After)
	ELSIF S1[i] > S2[i] THEN
	    RETURN(Before)
	ELSE
	    RETURN(Same)
	END
    END CompareStrings;

Picky-point: there are syntax errors in the WHILE test: AND has a precedence
higher than the relational operators.

This solution is similar to Juerg's, but it correctly (as far as I can tell)
handles all the special cases (requiring 8 RETURNs).  This aesthetic issue of
balance seems interesting here.  How much code should one place inside a loop
versus outside a loop. Some solutions place almost everything inside the loop;
others place almost everything after the loop; some place a bit inside the
loop and a bit after it.
-------

sbs@valid.UUCP (Steven Brian McKechnie Sargent) (07/10/86)

I've been idly interested in Modula-2 for a little while, especially
since I admire the author of the decwrl compiler (Michael Powell) as
a fellow of fine judgement in these matters.  However, given the
length of discussion required to solve the rather trivial problem of
string comparison, I'm very concerned!  Perhaps I will delay the
plunge into strongly typed languages a little longer, and save some
wear and tear on my keyboard.

I'll close with a re-implementation of the UNIX(TM) library classic strcmp,
which assumes NUL-termination for every string, and does not attempt to
keep around a separate "length" field.

/*
 * Return greater than, equal to, or less than 0 depending on whether
 * s is greater than, equal to, or less than t.
 */
strcmp(s, t)
	register char *s, *t;
{
	while (*s++ == *t)
		if (*t++ == 0)
			return 0;
	return *--s - *t;
}


S.

(TM) UNIX is a trademark of my cat.

ian@loral.UUCP (07/11/86)

In article <422@valid.UUCP> sbs@valid.UUCP (Steven Brian McKechnie Sargent) writes:
>I've been idly interested in Modula-2 for a little while, especially
>since I admire the author of the decwrl compiler (Michael Powell) as
>a fellow of fine judgement in these matters.  However, given the
>length of discussion required to solve the rather trivial problem of
>string comparison, I'm very concerned!  Perhaps I will delay the
>plunge into strongly typed languages a little longer, and save some
>wear and tear on my keyboard.
>

  Modula has its problems (e.g., no ARRAY of ARRAY of <type> and you
  can't implement printf), but this string compare issue is not a problem.

  The complexity of the string compare function in Modula has nothing
  to do with the fact that Modula is a strongly typed programming language.
  The problem has its roots in the way that most Modula systems implement
  strings.  A string is null terminated, unless it fills the entire
  character array, in which case there is no null.  As we have seen, this 
  means that the string compare code must perform checks to make sure that
  the string arrays are not indexed beyond their bounds.  There is no 
  reason that strings could not be implemented in the "right" way, always 
  null terminated.  As far as I know there is nothing explicit in Wirth's 
  books that disallows this.

  By the way, Michael Powell and his colleagues have fixed a number of 
  Modula's problems in their Modula implementation.

		     Ian Kaplan
		     Loral Dataflow Group
		     Loral Instrumentation
		     (619) 560-5888 x4812
	     USENET: {ucbvax,decvax,ihnp4}!sdcsvax!sdcc6!loral!ian
	     ARPA:   sdcc6!loral!ian@UCSD
	     USPS:   8401 Aero Dr. San Diego, CA 92123