PATTIS@WASHINGTON.ARPA (Richard Pattis) (07/09/86)
A Book Notice and Problem Statement A New Modula-2 Book: I recently received a copy of a new Modula-2 book. It was written by Arthur Sale of the Univeristy of Tasmania in Australia, and it is titled "Modula-2: Discipline & Design". It is published by Addison-Wesley and costs about $25. I'll quote a section from the preface to show what I think is the flavor of this book (all typos are my own): "My purpose in writing this book is to bring together a precise discussion of Modula-2 and its features and modern concepts of professional large program writing. There are few books which are accessible to learning students which use Modula-2 at the present time, and this is an inhibiting factor in its use as a teaching language. However, even more important in this book is the break with the traditional approaches to first language texts. It is my experience over ten years that learning students can cope with concepts of considerable abstraction, such as predicate transformers and invariants. Not all the implications are realized by students, but the concepts can be planted and used. The best students progress much faster, but even average students see the ideas which are presented as a result of modern computer science research as converting the dark art of programming into something which is understandable. The era of learning programming by osmosis, which has persisted for far too long, is ending, and programming has become a subject in which design and discipline reign and can be taught" A Modula-2 Problem: In Sale's discussion of strings, he decides to store each string padded with 0Cs (not just one, but 0Cs all the way up thru the last element in the array). Of course, if the string fills the array, there is no padding at all. Also, note that the string terminator, 0C, compares less than any other character. Given any two parameter strings that follow this convention, Sale writes the following function to compare them (I've tried to transcribe the code directly from this introductory text, using the same indentation and spacing): (1) I assert, that this function computes wrong answers. In what cases does it do so? Be as general as possible and remember the assumptions above. (2) Describe how we could rewrite this function to correct this deficiency. (3) Given the same assumptions, completely rewrite this function to be as simple as possible. It should also execute as quickly as possible. (4) Now, remove the padding assumption; instead, assume that only one 0C is placed at the end of each string that doesn't fill its array. Rewrite this function to work correctly with strings following this convention. (5) What interesting conclusions can you draw from this exercise? For parts 3 & 4, use your best programming style. Write these function as clearly as possible. Please send your answers to me, PATTIS@WASHINGTON. After you send me your analysis, I'll send you mine. I'll also keep a compendium of the analyses that I receive, and send it out to the mailing list when responses stop. If you don't want your analysis shown to anyone, or if you don't want your name attached to it, please tell me when you mail in your analysis. If there is enough interest in this exercise, I'll construct a similar one. TYPE Order = (Before,Same,After); PROCEDURE CompareStrings (S1: ARRAY OF CHAR; S2 : ARRAY OF CHAR) : Order; CONST NUL = 00C; (* Terminating Marker *) VAR i : CARDINAL; Shortest : CARDINAL; Result : Order; ResultKnown : BOOLEAN; BEGIN Shortest:= Min(HIGH(S1),HIGH(S2)); ResultKnown:= FALSE; i:= 0; WHILE NOT ResultKnown DO IF S1[i] < S2[i] THEN (* Found unmatched character *) Result:= Before; ResultKnown:= TRUE; ELSIF S1[i] > S2[i] THEN (* Found unmatched character *) Result:= After; ResultKnown:= TRUE; ELSE (* S1[i] = S2[i] *) IF S1[i] = NUL THEN (* Both strings same up to two NULs. *) Result:= Same; ResultKnown:= TRUE; ELSE (* Task unresolved, must look at next component. *) i:= i+1; IF i > Shortest THEN (* Run out of one string's components. *) IF HIGH(S1) = HIGH(S2) THEN (* Both together. *) Result:= Same; ELSIF Shortest = HIGH(S1) THEN (* S1 shorter *) Result:= Before; ELSE (* S2 shorter *) Result:= After; END; ResultKnown:= TRUE; END; END; END (* of IF comparing S[i]'s *); END (* of WHILE *); RETURN Result; END CompareStrings; Coding & Correctness: A Complex String Processing Example Please excuse the way that this note is written; I've been writing so long for underclassmen, I may have forgotten how to address the experts. I actually wrote this material for students in my second quarter programming course. (1) The CompareString function returns an incorrect answer when comparing S1 (a string filling its array) with S2 (a string equal to S1, but stored in a larger array; e.g., S2 is terminated by at least one 0C). The same is true when the properties of S1 and S2 are switched. For both these inputs, the CompareString function returns Before (After) instead of Same: the nested test HIGH(S1) = HIGH(S2) needs to determine whether the two strings are the same size - but instead, it only checks whether the two arrays are the same size. (2) CompareString could be corrected by adding more code to check for special cases. This added code would make a complex function even more complicated. Instead of fixing the "minor" problem here (and who knows what other "minor" problems would then crop up), I felt that the whole function could instead be rewritten much more simply. This was the gut feeling that motivated the entire problem. When teaching introductory programming, I quote to my students the following saying: "Bad programmers fix errors by adding corrective code. Good programmers fix errors by removing defective code." (3) When one writes a simpler version of CompareStrings, one finds that the idea of extra padding is unnecessary, Only in the original version of the CompareStrings function does padding simplify coding. Therefore, see the next paragraph for a solution to this problem too. (4) When I teach string processing in Modula-2, I first show my students the following High function (which is analogous to the built-in HIGH function). It takes O(N) to execute High - I assume that HIGH is O(1) - but it supplies its information about a string (as opposed to information about the array in which the string is stored). PROCEDURE High (S : ARRAY OF CHAR) : INTEGER; VAR I : INTEGER; BEGIN FOR I:=0 TO HIGH(S) DO IF S[I] = 0C THEN RETURN I-1 END END; RETURN HIGH(S) END High; Given this High function, and trichotomous versions of CompareChars and CompareIntegers, I wrote a much simpler version of CompareStrings. These three functions don't appear in my final solution, but I used them to help me write a simple, correct piece of code (which I then had enough confidence in to try to optimize). PROCEDURE CompareStrings (S1, S2 : ARRAY OF CHAR) : Order; VAR I : INTEGER; BEGIN FOR I:= 0 TO MIN(High(S1),High(S2)) DO CASE CompareChars(S1[I],S2[I]) OF Before : RETURN Before | After : RETURN After | Same : (* continue scanning *) END END; RETURN CompareIntegers(High(S1),High(S2)) END CompareStrings; My final solution to this problem follows a similar algorithm. It pushes an approximate calculation of each High function inside the FOR loop: if a 0C is reached in either S1 xor S2, then that string is smaller; if 0C is reached in both S1 and S2, the strings must be equal. The FOR loop exits (as opposed to returns) only if all elements thru HIGH of the smaller array are identical to all equivalent elements in the larger (or equal-sized) array. The final three statements handle the special case checking; they determe whether one string is truly longer than the other. Notice that at most one (and maybe neither) of the final IF statements may be TRUE, because MinH is >= at least one HIGH. PROCEDURE CompareStrings (S1, S2 : ARRAY OF CHAR) : Order; VAR I,MinH : INTEGER; BEGIN MinH:= MIN(HIGH(S1),HIGH(S2)); (* Or, expand into an IF *) FOR I:= 0 TO MinH DO IF S1[I] < S2[I] THEN RETURN Before END; IF S1[I] > S2[I] THEN RETURN After END; IF S1[I] = 0C THEN RETURN Same END (* If here, S1[I] = S2[I] *) END; IF (MinH < HIGH(S2) AND (S2[MinH+1] <> 0C) THEN RETURN Before END; IF (MinH < HIGH(S1) AND (S1[MinH+1] <> 0C) THEN RETURN After END; RETURN Same END CompareStrings; I'm assuming that a good optimizing compiler will cache the values of S1[I] and S2[I] inside the loop so that these arrays are actually accessed only once per loop iteration. If not, I can always declare two local variables C1 & C2 of type CHAR, and assign them the values of S1[I] and S2[I] respectively as the first two statements inside the loop. Notice that if all strings ended in at least one 0C, then the entire block of code following the FOR loop could be removed. In fact, the FOR loop could be turned into a LOOP incrementing I from 0 upward, avoiding the need for a MinH variable and any evaluation of HIGH. Finally, if we never needed to compare string literals, we could change both the parameter modes to VAR and avoid the copying (although this advantage may be offset by an increase in the time needed for each array access). Further transformations for efficiency gains may be very dependent on the compiler and machine architecture used. There is a slight oxymoron in question (3), where I ask for the simplest and most efficient code. Often these qualities oppose each other; but for the most part (ruling out exotic algorithms), the simplest code is (or can often be made, by something as simple as an optimizing compiler) the most efficient. (5) On teaching programming: First, I'd like to quote dircectly what Sale says about the implementation of StringCompare, after showing the code (page 266). "The procedure design was kept straightforward and provably correct on the grounds that this provides the best illustration for learners...The complex IF tests which determine the result when the procedure reaches the end of one or both strings is very rarely executed. (Of course, this makes its provable correctness even more important, because it may not be tested carefully!)" I would like to suggest that in programming, the search for simplicity should take precedence over the search for a "proof of correctness". Too many times I have written a complex algorithm and spent time trying to prove it correct (and sometimes succeeding, even when the algorithm was wrong), rather than spending my time looking for a simpler algorithm. This is the standard penny wise, pound foolish approach to reality. Unfortunately, when one searches for a complex proof of correctness, one often finds it too easily and in error. I would rather find a simple algorithm that is easily explained than a proof. For introductory programming classes, I try to emphasize the teaching of critical thinking. I attempt teach techniques for trying to prove that code is incorrect (such an activity may be labeled as aggressive debugging - even in the absence of manifest bugs). Mathematically proving and disproving code are identical, but in training future programmers, I would much prefer that my students say, "I haven't found anything wrong with my code yet," to the all too frequently heard, "It's correct." Of course, it's easier to spot errors in someone else's code, so one good way to teach this skill is to ask students to analyze code written by others (known to be wrong) and find any errors. They will enjoy this exercise more if the code has been published somewhere. As an exerci se, examine your favorite Pascal book for the code implementing some foundational algorithms in computer programming: prime testing, binary array searching, an O(N^2) sort, Quicksort, etc. Ask yourself first whether the code is correct; next ask yourself whether the code is presented as simply and clearly as possible; could you do better using Modula-2? Remember, these examples are used to introduce students to programming; they will be studied by tens of thousands of students and become standards for comparision. On formalism: I believe that formalism is useful only when it can be applied to extend what we can do, or accurately check what we have done. Formalism for its own sake is pedantry. For example, I advocate the use of simple control structures such as Pascal's, which subscribe to the 1-in/1-out rule. Yet in Modula-2 we may violate this rule by using LOOP with EXIT and subroutines that use RETURN. I use these features to simply implement simple algorithms: often such code is easier to understand, because it omits the introduction and use of extraneous state variables and other circuitous control flow. Here is an example function that appears in most introduction to programming books. It performs a linear search in array A for the index of Key; it returns the first index of Key in A (if such an index exists); otherwise it returns Default. I'll first show the Modula-2 solution that I teach. PROCEDURE Search1 (VAR A : ARRAY OF INTEGER; (* Efficiency *) Key : INTEGER; Default : INTEGER) : INTEGER; VAR I : INTEGER; BEGIN FOR I:=0 TO HIGH(A) DO IF A[I] = Key THEN RETURN I END (* Found first I: A[I] = Key *) END; RETURN Default (* 0<=I<=HIGH(A): A[I] <> Key *) END Search1; Here is the simplest solution that I know adhering to the 1-in/1-out rule. PROCEDURE Search2 (VAR A : ARRAY OF INTEGER; (* Efficiency *) Key : INTEGER; Default : INTEGER) : INTEGER; VAR I,AnswerIndex : INTEGER; BEGIN I:= 0; WHILE (I <= HIGH(A)) AND (A[I] <> Key) DO I:= I+1 END; IF I <= HIGH(A) (* Determine why WHILE loop exited *) THEN AnswerIndex:= I (* Found first I: A[I] = Key *) ELSE AnswerIndex:= Default (* 0<=I<=HIGH(A): A[I] <> Key *) END; RETURN AnswerIndex END Search2; Some comments on these solutions: Search1 uses two RETURN statements, one in the FOR loop. Many programmers consider the premature RETURN from the FOR loop to be bad programming practice. Search2 has one complex test instead of two simpler ones (actually, Search1 hides one test in the FOR's semantics). The two tests specify under what condition the loop should continue, the first specifies that the array end has not been reach, the second specifies that the Key has not been found; I find both these "negative" tests difficult to under- stand (is there a better solution using REPEAT?). In Search2, we must remember to include code to initialize, increment, and test the loop index. In Search2, we must disambiguate the loop exit condition in a seperate IF statement after exiting the loop; even in this IF we cannot perform the easier-to-understand "positive" test A[I] = Key (this inability is a direct result of short-circuit boolean expression evaluation, discussed below). Finally, in Search2 we need to introduce an extra state variable AnswerIndex (or worse, reuse the variable I). If this second example is upsetting, now remove the ability to perform short-circuit boolean evaluation (SCBE) and rewrite this function in Pascal. Some programmers may argue that SCBE should be disallowed for similar reasons of formalilty, because it doesn't support the intuitive semantics for boolean expressions (commutivity); in the example above, we must very carefully select our post-loop IF test. OK, I lied: here is a simpler Search2 example. Other people use a variant that replaces AnswerIndex with a boolean variable indicating whether the key element has been found (such a solution must set I to the correct return value either inside the loop or after it exits). This style of coding also works in Pascal. I still think that Search2 is more complicated and more difficult to understand than Search1: for example, I think that it is un-intuitive to set AnswerIndex to Default before the loop code executes. Also, this code fails if the first occurrence of Key is in A[Default], and it is followed by another occurence of Key. There is much subtlety is this code. Got a better solution? PROCEDURE Search2 (VAR A : ARRAY OF INTEGER; (* Efficiency *) Key : INTEGER; Default : INTEGER) : INTEGER; VAR I,AnswerIndex : INTEGER; BEGIN AnswerIndex:= Default; I:= 0; WHILE (I <= HIGH(A)) AND (AnswerIndex = Default) DO IF A[I] = Key THEN AnswerIndex:= I END; (* Found first I: A[I] = Key *) I:= I+1 END; RETURN AnswerIndex END Search2; Here again I have abstracted some quotations from Sale's book (page 401). I have numbered the statements for easier reference later: "(1) Program fragments containing EXIT statements or unusually placed RETURN statements which are not at the termination points of their procedures (call these 'early returns') have a flow and execution which is not easy to understand....(2) Proof techniques or constructively correct design proce- dures are not easily applied to these constructs because they have complex semantic definitions....(3) Reading programs with EXIT statements and early returns can be difficult, especially if long procedure bodies are involved. Finding all the relevant embedded statements and in the case of EXITs determining which loop they belong to will require visual parsing of the text....(4) Other reasons for the constructs usually put forward are either convenience or efficiency. Both arguments are at odds with the primary aims of Modula-2, which are clarity, abstraction, and correctness....(5) Most programmers who are taught programming in accordance with modern computer science concepts have thought patterns which never evoke them." (1) I believe that the flow of my CompareStrings and Search1 procedures are easier to understand than the original CompareStrings or either of the Search2 procedures presented. Understandability isn't an inherent quality of a control construct; it is a quality of how a control construct is used. I often find it more natural to "return early"; avoiding such a control flow often reduces the readability of my code. (Also see the whole GOTO controversy.) (2) Formalists may like to ban such programming constructs because they have poor proof rules and can be abused. Yet the controlled use of these constructs often allows programmers to write terse, correct code. I prefer code that is easily understandable to code that is easily provable (meaning code that uses constructs with cleaner semantics). To me, the ultimate test for using a specific construct is can I, as a psychological entity, understand its meaning better with or without using that construct. I always try to compare code that uses a construct with code that omits it and uses alternatives. It is still possible to write terrible code while adhering to the 1-in/1-out constructs. (3) In big subroutines, like the original StringCompare function, EXITs and RETURNs may be visually lost (a clear commenting style would help); but in a small subroutine (and it is my goal to make all my subroutines small), such complex features can be used in simple ways to increase the clarity of the code. Here the author is advocating use based soley on syntactic advantages. (4) To me it is often convenient to use such coding practices because they are the natural, most clearly understood alternatives. In such a case, I believe that I am not at odds with Modula-2's "primary aims". I also beleive that the aim of efficiency is part of Modula-2 also. (5) I believe that I am being accussed of sloppy thinking here. The question to me is whether one believes that formalism is the highest goal to aspire to; whether one can follow some rules and be guaranteed to write beautiful code. I prefer to let my thought patterns run wild. Rather than strive for formality, I strive for simplicity (and find it in the darndest places). In summary I think that formalism, when correctly applied, is wonderful. But formalism works best when applied to simple code. Before jumping into a formal analysis of code, I like to try to ensure that the code is written as simply and as easy to understand as possible. This precondition increases the chances that further formal analysis will supply insight. On language design: By specifying two different ways in which strings can be represented in Modula-2, Wirth has traded increased storage efficiency for decreased algorithm clarity (and efficiency). A specification that each string must be terminated by a 0C would make most string processing operations much simpler (and therefore easier to write and understand), more efficient, and less prone to errors. On Sale's book: I plan to use it in my introduction to programming classes next year; apart from my negative comments here, this book is the best intro- ductory text to Modula-2 that I have seen. The following pages contain the submissions that I received from the five responders to my original message. I have edited and annotated these sub- missions. I will post the complete text of all messages in another note. Finally, when I analyzed these submissions, I had all five in front of me, as well as my own solution. The five individual authors did not have this advantage. If you have an comments on this note, please send them to me. The Annotated Contributed Solutions Tom Rodeheffer of DEC SERC correctly identified the error as follows: "Sale's algorithm fails when two equal strings are stored in arrays of different sizes, one array being exactly the length of the string, so that no padding is required, the other being longer, so that at least one padding character is present. Sale's algorithm will say that the string stored in the shorter array compares before the other. The problem occurs when the algorithm reaches the end of the shorter string without finding a difference or a string terminator. The algorithm neglects to check if the longer string has a string terminator in the next character position." He contributed the following beautiful piece of code (although I prefer LOOP to WHILE TRUE - picky, picky). PROCEDURE CompareStrings (s1, s2 : ARRAY OF CHAR) : Order; VAR i : CARDINAL; c1 : CHAR; c2 : CHAR; BEGIN i := 0; WHILE TRUE DO IF i <= HIGH(s1) THEN c1 := s1[i] ELSE c1:= 0C END; IF i <= HIGH(s2) THEN c2 := s2[i] ELSE c2:= 0C END; IF c1 < c2 THEN RETURN Before END; IF c1 > c2 THEN RETURN After END; IF c1 = 0C THEN RETURN Same END; i := i + 1; END; END CompareStrings; He then perspicuously describes why his algorithm works: "This algorithm views each array as if it were followed by an endless source of string terminators. Thus, under this view, strings always end with a string terminator. The terminator will come either from the array of from the view. Given these conditions, it is easy to compare the strings." Next, he discusses optimizations (where he leads the way to my solution). "The algorithm may be made more efficient by writing one section of code to handle character positions that occur in both arrays, and another section to handle the next character position, which does not occur in both arrays. However, this makes the algorithm more complicated and harder to understand." He goes on to say that if every string were guaranteed to be terminated by a 0C, the first two IF statments inside the loop could be simplified to straight array assignments, without any loss of correctness. His conlcusion: "It is interesting how easy it is to stumble over special cases, even for people who work with invariants and other program proof techniques." My conclusion: I like Tom's algorithm more than mine. It is simpler to read, understand, and prove correct (in both an intuitive and formal sense). Jonathan Brown (brown#jon%mfe@lll-mfe.arpa) also correctly identified the error, which he described as follows: "Sale's function will return the wrong result when the following three conditions arise: 1) the two strings are identical 2) the lengths of the arrays containing the strings differ 3) the string in the shorter array uses up the entire array" He also realized that the assumption of multiple 0C padding was unnecessary, so he submitted a revised solution for both parts 3 & 4 with the following preface: "For my solution, I'm going to be lazy and just fix the bug in Sale's function. If there is a simpler, quicker routine, it is not obvious to me." Here is the relevant part of his solution IF i > Shortest THEN (* Run out of one string's components. *) IF HIGH(S1) = HIGH(S2) THEN (* Both together. *) Result:= Same; ELSIF (Shortest = HIGH(S1)) & (S2[i] # NUL) THEN (* S1 shorter, but strings not equal *) Result:= Before; ELSIF S1[i] # NUL THEN (* S2 shorter, but strings not equal *) Result:= After; ELSE Result:= Same; END; ResultKnown:= TRUE; END; Interestingly enough, this correction doesn't completely solve the problem. I assert that this function will yield an execution error under exactly the conditions stated above (with S1 being the completely filled array). If such is the case, the first conjunct in the ELSIF is TRUE, but the second is FALSE; therefore, the second ELSIF test is evaluated. The access S1[i] is illegal when Shortest = HIGH(S1) and i > Shortest. It seems as if the solution should be more symmetrical in its predicates. The modification to this code that yields a correct solution is shown at the end of this message. If you have not yet derived the variation of the solution, try now before reading further. Finally, he closes with the following statements: "There is a bug in Sale's function that results in an erroneous value returned in certain (probably rare) circumstances. My solution looks bug-free to me, but I haven't tested it so I won't swear by it." I find this quote intriguing. No doubt Sale tested his function extensively. The probability of finding the "incorrect cases" is very low. But even after Jonathan Brown identified these cases by pure deductive logic, he didn't test them using his solution (actually, there are two cases S1 being filled and S2 being filled - only the first causes the execution error). Given that Sale's function was probably extensively tested, why does Jonathan end his submission with the an appeal to testing? Erik Jacobsen (mcvax!daimi!erja@seismo.CSS.GOV) didn't actually state inputs that would cause the function to fail, but he identified Sale's misconception: "The problem is that HIGH(...) is the declared length of the ARRAY, not the actual length of the string. One way to patch it could be to start finding the actual length of each string, and use these lengths instead of HIGH(...)" He also realized that the assumption of multiple 0C padding was unnecessary, so he submitted a revised solution for both parts 3 & 4. In addition, he spotted a major simplification; note that CharPos is a functionalized version of Tom Rodeheffer's "viewing solution". "Rewriting the function to be simpler, I first observe that we have two criteria to indicate the length of a string - either the ARRAY is filled or it is padded with NULs. I would prefer a simpler representation, e.g. that there always will be a NUL at the end of a string. One way to accomplish this can be to look at a string through this function: PROCEDURE CharPos(S: ARRAY OF CHAR; Pos: CARDINAL): CHAR; BEGIN IF Pos > HIGH(S) THEN RETURN NUL ELSE RETURN S[Pos] END END; He then finishes writing his solution (with a justification) by showing how CharPos can be used within CompareStrings: "Rewriting CompareStrings as it stands is fairly simple. Use CharPos(Sx,i) instead of Sx[i] - or declare som CHAR-variables. As you will always end up with NUL, you can erase the code following IF i>shortest. CompareStrings still looks funny though - it's not easy to check the correctness. My way of doing it is:" PROCEDURE CompareStrings (s1, s2 : ARRAY OF CHAR) : Order; VAR i : CARDINAL; c1 : CHAR; c2 : CHAR; BEGIN i := 0; REPEAT Ch1 := CharPos(S1,i); Ch2 := CharPos(S2,i); i := i + 1 UNTIL (Ch1 <> Ch2) OR (Ch1 = NUL); IF Ch1 < Ch2 THEN RETURN Before ELSIF Ch1 > Ch2 THEN RETURN After ELSE RETURN Same END; END CompareStrings; Here Erik has retained the 1-in/1-out style of programming (except for the 3 RETURNs at the), placing the final disambiguation tests after the end of his loop. Note that this solution requires only two character comparisions per loop (less than mine or Rodeheffer's). He finally states that: "I haven't considered efficiency (and I usually don't care)." Erik concludes with a set of annotated quotatations from Kernighan & Plauger, "The Elements of Programming Style", which he says Sale has violated in his function. He follows these quotes with a rule of his own: "Write clearly - don't be to clever". "Write clearly - don't sacrifice for efficiency". Whatever the reasons were, the original function wasn't written clearly. "Choose a data representation that makes the program simple". "Terminate input by end-of-file or marker - not by count". Having two conditions for ending a string will make a program complicated. "Don't patch bad code - rewrite it". As was suggested in (3) and (4). A rule of my own: "A simple problem must have a simple solution," so keep re- writing your code until it doesn't look more complicated than the problem. Juerg Wanner (mcvax!cernvax!owf!zjwa@seismo.CSS.GOV) starts his submission: "Not only will this (Sale's) procedure produce wrong results, but eventually the program using this procedure will crash due to a runtime error (index i can get larger than Min(HIGH(S1), HIGH(S2)) and so the procedure attempts to access a non-existing array element. A proper modula-2 implementation will crash there (unless run-time checks have been turned off)." I must admit that I don't see how this can happen: yes, index i can exceed Shortest, but an index with this value is never used to access either array. (Someone please supply me with a simple example that exhibits this error.) Juerg next supplies the following efficient CompareStrings function; he also comments on his use of VAR parameters, but forgets the case of passing string constants: "The parameters are declared as VAR parameters because of speed. There is no need to copy the strings in order to do a comparison. Of course this is bad programming style." PROCEDURE CompareStrings(VAR s1, s2 : ARRAY OF CHAR) : Comparison; VAR i, high : CARDINAL; BEGIN i := 0; high := Min(HIGH(s1), HIGH(s2)); WHILE (i < high) & (s1[i] = s2[i]) & (s1[i] # 0C) & (s2[i] # 0C) DO INC(i); END; IF s1[i] < s2[i] THEN RETURN less; ELSIF s1[i] > s2[i] THEN RETURN greater; ELSIF HIGH(s1) = HIGH(s2) THEN RETURN equal; ELSIF high = HIGH(s1) THEN IF s2[i+1] = 0C THEN RETURN equal; ELSE RETURN less; END; ELSE IF s1[i+1] = 0C THEN RETURN equal; ELSE RETURN greater; END; END; END CompareStrings; Picky-point: the WHILE test suffix (s1[i] # 0C) & (s2[i] # 0C) is redundant; to get here with SCBE, we know that s1[i] = s2[i], so we need only test one of these elements for inequality with 0C. This CompareStrings procedure again uses the 1-in/1-out style of programming (except for the 7 RETURNs). It is much in the same style as my solution, but the omission of explicit RETURNs from the body of the loop causes an increase is special-case checking after the loop. The Rodeheffer/Jacobsen solutions totally omit these "complex after the loop" checks via their viewing property. After a second look at this function (the 7 RETURNs looked too asymmetic, a sign that there may be something wrong), I found that indeed this function contained a subtle error. Look at the case where the two strings s1 and s2 are identical: both are short strings (ended by 0C) in very long arrays. Finally assume that the two actual arrays sizes are different. The WHILE loop will exit when both 0Cs are reached (with the index i addressing each 0C). The first IF and ELSIF tests are FALSE (s1[i]=s2[i]=0C), and so is the next ELSIF (the arrays are of unequal lengths). Assume that the array containing s1 is the smaller one; so, high = HIGH(S1). In this case we compare s2[i+1] against 0C; but this comparision is totally bogus, because s2[i+1] specifies the first character BEYOND THE 0C in string s2. Thus, the result of the returned value is random. The same is likewise true if high = HIGH(s2). Bob Campbell (fluke!tikal!bobc@SUN.COM) also correctly identified the error, which he described as follows: "The answer that I have is two strings which are the same but one fills up the full string and the second is shorter then a full array when this happens the string which fills up the whole array will be less then the one which does not." His modification to Sale's original program is given by: IF i > Shortest THEN (* Run out of one string's components. *) IF HIGH(S1) = HIGH(S2) THEN (* Both together. *) Result := Same ELSIF Shortest = HIGH(S1) THEN (* S1 shorter *) IF (S2[i] = NUL) THEN Result := Same ELSE Result := Before END ELSE (* S2 shorter *) Result := After END; ResultKnown := TRUE END Again, as in the previous "simple alteration", the code is still not correct. It only goes halfway to the solution. If the "standard conditions" hold for S1 and S2 (with S1 being 0C ended and placed in the larger array), this function returns After, when in fact the two strings are ths same. Again, the symmetry of the HIGHs is missing from the alteration. Bob completely rewrites a final version for efficiently comparing Modula-2 strings as follows (here the symmetry is present): PROCEDURE CompareStrings(S1,S2:ARRAY OF CHAR) : Order; CONST NUL = 0C; (* Terminating Marker *) VAR i : CARDINAL; Shortest : CARDINAL; BEGIN Shortest := Min(HIGH(S1),HIGH(S2)); i := 0; (* As long as the strings are the same check and the end of * neither of the strings have been reached increment to the * next. *) WHILE i <= Shortest AND S1[i] = S2[i] AND S1[i] # NUL DO i := i + 1 END; IF (i > Shortest) THEN (* We are at the end of one of the strings *) IF HIGH(S1) = HIGH(S2) THEN (* Strings are the Same length *) RETURN(Same) ELSIF HIGH(S2) > Shortest THEN (* S2 is longer *) IF S2[i] = NUL THEN (* Both strings terminate at the same place *) RETURN(Same) ELSE RETURN(Before) END ELSE (* S1 is longer *) IF S1[i] = NUL THEN (* Both strings terminate at the same place *) RETURN(Same) ELSE RETURN(After) END END (* ASSERT at this else S1 can't equal S2 or both = NUL *) ELSIF S1[i] < S2[i] THEN RETURN(After) ELSIF S1[i] > S2[i] THEN RETURN(Before) ELSE RETURN(Same) END END CompareStrings; Picky-point: there are syntax errors in the WHILE test: AND has a precedence higher than the relational operators. This solution is similar to Juerg's, but it correctly (as far as I can tell) handles all the special cases (requiring 8 RETURNs). This aesthetic issue of balance seems interesting here. How much code should one place inside a loop versus outside a loop. Some solutions place almost everything inside the loop; others place almost everything after the loop; some place a bit inside the loop and a bit after it. -------
sbs@valid.UUCP (Steven Brian McKechnie Sargent) (07/10/86)
I've been idly interested in Modula-2 for a little while, especially
since I admire the author of the decwrl compiler (Michael Powell) as
a fellow of fine judgement in these matters. However, given the
length of discussion required to solve the rather trivial problem of
string comparison, I'm very concerned! Perhaps I will delay the
plunge into strongly typed languages a little longer, and save some
wear and tear on my keyboard.
I'll close with a re-implementation of the UNIX(TM) library classic strcmp,
which assumes NUL-termination for every string, and does not attempt to
keep around a separate "length" field.
/*
* Return greater than, equal to, or less than 0 depending on whether
* s is greater than, equal to, or less than t.
*/
strcmp(s, t)
register char *s, *t;
{
while (*s++ == *t)
if (*t++ == 0)
return 0;
return *--s - *t;
}
S.
(TM) UNIX is a trademark of my cat.
ian@loral.UUCP (07/11/86)
In article <422@valid.UUCP> sbs@valid.UUCP (Steven Brian McKechnie Sargent) writes: >I've been idly interested in Modula-2 for a little while, especially >since I admire the author of the decwrl compiler (Michael Powell) as >a fellow of fine judgement in these matters. However, given the >length of discussion required to solve the rather trivial problem of >string comparison, I'm very concerned! Perhaps I will delay the >plunge into strongly typed languages a little longer, and save some >wear and tear on my keyboard. > Modula has its problems (e.g., no ARRAY of ARRAY of <type> and you can't implement printf), but this string compare issue is not a problem. The complexity of the string compare function in Modula has nothing to do with the fact that Modula is a strongly typed programming language. The problem has its roots in the way that most Modula systems implement strings. A string is null terminated, unless it fills the entire character array, in which case there is no null. As we have seen, this means that the string compare code must perform checks to make sure that the string arrays are not indexed beyond their bounds. There is no reason that strings could not be implemented in the "right" way, always null terminated. As far as I know there is nothing explicit in Wirth's books that disallows this. By the way, Michael Powell and his colleagues have fixed a number of Modula's problems in their Modula implementation. Ian Kaplan Loral Dataflow Group Loral Instrumentation (619) 560-5888 x4812 USENET: {ucbvax,decvax,ihnp4}!sdcsvax!sdcc6!loral!ian ARPA: sdcc6!loral!ian@UCSD USPS: 8401 Aero Dr. San Diego, CA 92123