[comp.lang.misc] The Joy of Zero-based Arrays

skef@SPICE.CS.CMU.EDU (Skef Wholey) (03/03/88)

In a couple of recent articles, people have asserted that zero-based arrays
are somehow more burdensome and less natural to use than one-based arrays:

>From: sommar@enea.se (Erland Sommarskog)
>[...]
>  The idea of having all arrays starting on index zero is really to
>put the burden on the programmer. If Wirth really must have a fixed
>lower index, he could at least have chosen one! How many of you do
>often use arrays with zero as the lower bound? I do it very seldom.
>And his argument about saving execution time for index computation
>is truely naive. The result will just be references like:
>    String(char_no + 1), 
>just making the code more verbose and complex for no use.
>[...]

>From: pattis@june.cs.washington.edu (Richard Pattis)
>
>I too detest 0-based arrays (and think 1-based arrays are better).  In trying
>to rationalize this choice, I came up the the following argument: When we
>process arrays, we often need to know the size of the array and/or the
>position of the last element.  In 0-based arrays, these quantites are off by
>one; in 1-based arrays they are the same.  Hence, I think that people will
>make fewer errors with 1-based arrays.

Here I argue the opposite: that zero-based arrays are God's Gift To
Programmers, and that one-based arrays are Satan's Way Of Complicating Code.
I find that add-or-subtract-one code is more likely to arise in situations in
which one-based arrays are used.  I find that zero-based arrays lead to
simpler code.

Consider the task of processing subsequences of a one-dimensional array.
There are two obvious ways to describe a subsequence: with a start and an end,
or with a start and a length.  Consider now two problems: iterating over a
subsequence (while correctly dealing with zero length subseqences), and moving
from one subsequence to the next.

The notion of subsequence in Common Lisp (which, like C, provides only
zero-based arrays) is that the start index specifies the first element of a
subsequence and the end index specifies the index *one-beyond* the last
element of the subsequence.  Given the length and start of a subsequence, one
computes the end simply by adding (i.e., End = Start + Length).  Likewise, the
length of a subsequence can be computing by subtracting the start from the
end (i.e., Length = End - Start).  A zero-length sequence has Start = End.

   0 1 2 3 4 5 6 7 8
  +-+-+-+-+-+-+-+-+-+
  |X|X|X| | | | | | |
  +-+-+-+-+-+-+-+-+-+
   ^     ^
   |     |
 start  end

The subsequence of three X's has a start of 0 and an end of 3.  Its length is
3 - 0 = 3.  The length of this array is 9 -- it has nine elements; as a
subseqeunce it can be described as having start = 0 and end = 9 (or length =
9).  The Common Lisp Make-Array function uses this number, and the
Array-Length function will return this number.  Likewise the declared size in
C would be this number, and sizeof will return something proportional to it.
Within this beautifully consistent end-exclusive model, *none* of these
quantities are "off by one."  To frob the elements of a subsequence of this
sort, we do something like:

  (do ((i start (1+ i)))   ; Do for I beginning at Start, adding one each time
      ((= i end))          ; Until I = End
    (frob (aref array i))) ; Frobbing the I'th element of the Array

or

  (do ((i1 start (1+ i))
       (i2 0 (1+ i)))      ; I2 tells us where we are relative to the start
      ((= i2 length))
    (frob (aref array i1) (aref another-array i2)))

The first loop says, in non-Lispy terms:

  For I from Start (inclusive) to End (exclusive) Do
    Frob(Array[I]);

The end test must always come before the body of the loop so that zero-length
subsequences are processed correctly.

The subsequence following the one defined by the interval [start,end) is
[end,new-end).

  (do ((i old-end (1+ i)))
      ((= i new-end))
    ...)

The *only* +1's here are in the iteration steps; there are *none* in the
initialization or termination clauses.  The start-inclusive end-exclusive
property is captured in the simple iteration construct Dotimes.  For example,

   (dotimes (i 4)
     ...)

executes its body four times, for I=0, I=1, I=2, and I=3.

Now, one can of course use the end-exclusive model with one-based arrays, but
then the subsequence containing the whole array does *not* have an end that is
the highest index of the array; it is one more than that index.  So when using
one-based arrays people usually use an end-INclusive model, which also leads
to nastiness. The Length of a subsequence is now End - Start + 1.  To
represent a zero-length sequence the end index must be one LESS than the start
index.

  (do ((i start (1+ i)))
      ((= i (1+ end)))
    (frob (oref array i))) ; "Oref" for "One-based Aref"

or

  (do ((i1 start (1+ i))
       (i2 1 (1+ i)))      ; I2 tells us where we are relative to the start
      ((= i2 (1+ length)))
    (frob (oref array i1) (oref another-array i2)))

What was a clean, crisp, equality test with the end index in the first loop
above is now a sloppy-looking test against the end plus one.  What was a
comparison with the length is now a comparison with the length plus one.

And a following subsequence starts at one MORE then the end of the current
subsequence.

  (do ((i (1+ old-end) (1+ i))
      ((< i new-end))
    ...)

So the +1's suddenly start creeping into the code, into the initialization or
termination clauses.

I've written tens of thousands of lines of code in Common Lisp and C, and
after I understood the beauty and just plain right-ness of the end-exclusive
model, zero-based arrays became a joy to use.  I seldom, if ever, need to
adjust starts, ends, lengths, or indices by one.  But I had to do that kind of
stuff frequently when using one-based arrays.  The more random incrementing
and decrementing you do, the more fence-post errors your code will have.

Remember: Zero is your friend.

--Skef
  (Preferred mail address: Wholey@C.CS.CMU.EDU)

barmar@think.COM (Barry Margolin) (03/03/88)

In article <1012@PT.CS.CMU.EDU> skef@SPICE.CS.CMU.EDU (Skef Wholey) writes:
>In a couple of recent articles, people have asserted that zero-based arrays
>are somehow more burdensome and less natural to use than one-based arrays:
[references deleted]
>Here I argue the opposite: that zero-based arrays are God's Gift To
>Programmers, and that one-based arrays are Satan's Way Of Complicating Code.

I've used both PL/I (which allows arbitrary array indices, but
defaults to 1-based, and only has 1-based strings, and Lisp, which
only has 0-based arrays.  It hardly needs to be said that the best
thing is to allow the programmer to specify the index base.  I've had
my share of fencepost errors in both languages, but I think I prefer
1-based arrays.

>Now, one can of course use the end-exclusive model with one-based arrays, but
>then the subsequence containing the whole array does *not* have an end that is
>the highest index of the array; it is one more than that index.  

That's true for 0-based arrays, too.  For a 10-element 0-based array,
the end-exclusive subsequence is [0,10), but the highest index of the
array is 9.  If it is 1-based, the end-exclusive subsequence is
[1,11), and the highest index is 10.  What's the difference?

>								  So when using
>one-based arrays people usually use an end-INclusive model, which also leads
>to nastiness. The Length of a subsequence is now End - Start + 1.  To
>represent a zero-length sequence the end index must be one LESS than the start
>index.

Yes, I admit that arrays declared (1 .. 0) seem strange, and I think
some languages don't even allow it.

>
>  (do ((i start (1+ i)))
>      ((= i (1+ end)))
>    (frob (oref array i))) ; "Oref" for "One-based Aref"

actually, we generally used (> i end) as our end test.  Or we use
traditional FROM-TO loops, as in the PL/I

	do i = start to end;
	     frob (array(i));
	end;

In your posting you made mention of something like this when you
trandlated your Lisp DO to pseudo-Pascal:

>  For I from Start (inclusive) to End (exclusive) Do
>    Frob(Array[I]);

Most languages don't have an FOR-loop that is exclusive of the End
(Zetalisp LOOP is an exception, with its FOR <index> FROM <start>
BELOW <end> option); the End is usually inclusive, which works
perfectly when you use the end-inclusive model of array subranges.
For example, to loop over an entire array in PL/I you simply do:

	do i = lbound(array,1) to hbound(array,1);
	     frob (array(i));
	end;

This code is actually independent of the base of the array, because it
uses the LBOUND (Lower BOUND) and HBOUND (Higher BOUND) builtin
functions to determine the actual index range (the second argument is
the dimension whose bound should be returned).

>And a following subsequence starts at one MORE then the end of the current
>subsequence.
>
>  (do ((i (1+ old-end) (1+ i))
>      ((< i new-end))
>    ...)
>
>So the +1's suddenly start creeping into the code, into the initialization or
>termination clauses.

Well, in PL/I, at the end of a "do i = start to end" loop, the
variable i is left with the value end+1.  So, it is common to continue
with:

	do i = i to new-end;...

>I've written tens of thousands of lines of code in Common Lisp and C, and
>after I understood the beauty and just plain right-ness of the end-exclusive
>model, zero-based arrays became a joy to use.  I seldom, if ever, need to
>adjust starts, ends, lengths, or indices by one.  But I had to do that kind of
>stuff frequently when using one-based arrays.  The more random incrementing
>and decrementing you do, the more fence-post errors your code will have.

About the only time in PL/I where you have to add 1 is when going from
loops like the above to calls to the SUBSTR builtin, which extracts a
portion of a string.  This function takes start and length rather than
start and end.

As I pointed out above, there's no real reason why end-inclusive
subranges and 1-based arrays must be related.

By the way, Lisp isn't even consistent about whether it is 1-based or
0-based.  (FIRST <list>) is equivalent to (NTH 0 <list>), (SECOND
<list>) is (NTH 1 <list>), etc.  Mind you, I'm not complaining; the
ordinal functions are intended for those cases where the programmer
knows the position a priori, and we humans tend to think 1-based.

Barry Margolin
Thinking Machines Corp.

barmar@think.com
uunet!think!barmar

eugene@eos.UUCP (Eugene Miya) (03/04/88)

Given the choice of a 0 or 1 start, I now prefer 0, there are a few
applications where it is handy and you just make the array +1 larger,
or ignore element 0.  but this is a cludge as well.  Actually, I think
that Pascal and Ada handle this problem even better with subranges
and enumerations, but this is my quirk, Pascal's problem is those
stupid pred() and succ() operators.  Yes, range checking can be a
pain (read "reduced performance").  Another problem is how to handle
subarrays.  Implementation is row or column major (See comp.lang.fortran)
and this is an unfortunate detail.  There are other array tricks coming
down the pipe (read Fortran 8X) but when it boils down to either
1 based or 0 based are both arbitrary (ridiculous since either side can
find counter arguments).

From the Rock of Ages Home for Retired Hackers:

--eugene miya, NASA Ames Research Center, eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  "Send mail, avoid follow-ups.  If enough, I'll summarize."
  {uunet,hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene

skef@SPICE.CS.CMU.EDU (Skef Wholey) (03/04/88)

>From: barmar@think.COM (Barry Margolin)
>[...]
>I've used both PL/I (which allows arbitrary array indices, but
>defaults to 1-based, and only has 1-based strings, and Lisp, which
>only has 0-based arrays.  It hardly needs to be said that the best
>thing is to allow the programmer to specify the index base.  I've had
>my share of fencepost errors in both languages, but I think I prefer
>1-based arrays.

I can't contest your preference.  The articles I was responding to,
though, asserted that zero-based arrays had the "obvious" disadvantage
that things were "off by one a lot."  I hope I've shown that this isn't
really the case.

You (Barmar) bring up the good point that if you've got the right set of
iteration constructs, one-based arrays can be dealt with as cleanly as
zero-based arrays.  I agree with this, and almost said as much in my
previous post.  In Lisp the user can define these things as macros if
they aren't provided (as by Zetalisp's LOOP).  PL/I has these things,
but then, what doesn't it have?

This zero-based/one-based discussion arose from the description of the
Oberon programming language.  Oberon, like Pascal and Modula-2 and even
C, provides a few iteration constructs and no practical, general, way of
defining more (the C preprocessor leaves much to be desired as a
macro-writing language).  So, given these restrictions, I would say that
if you're going to have a limited number of iteration constructs AND
fixed-based arrays, the iteration constructs should be chosen to reflect
the base (be it zero or one) of your arrays.  (To put it another way: If
your language is going to put you in a straightjacket, it should at
least be a comfortable one.)

>>  (do ((i start (1+ i)))
>>      ((= i (1+ end)))
>>    (frob (oref array i))) ; "Oref" for "One-based Aref"
>
>actually, we generally used (> i end) as our end test.

Yeah, the "system" code I write usually does that -- you can't trust
lusers not to pass you bad indices.  But my personal taste is that =
looks much nicer than >.  No biggie.

On moving on to the next subsequence:
>... in PL/I, at the end of a "do i = start to end" loop, the
>variable i is left with the value end+1.  So, it is common to continue
>with:
>
>	do i = i to new-end;...

Yes, it's nice when the value of a loop index variable is defined after
the loop.  Unfortunately, not all languages do that, presumably to give
the implementor more leeway in the name of "efficiency."  (If a compiler
unrolls the loop, for example, it would still have to assign something
to the iteration variable afterwards.)

>About the only time in PL/I where you have to add 1 is when going from
>loops like the above to calls to the SUBSTR builtin, which extracts a
>portion of a string.  This function takes start and length rather than
>start and end.

Is that the only builtin that uses lengths?  That seems like a kind of
odd wart on the language, if so.  If not, I'd argue as before that the
end-exclusive property Length = End - Start is something worth trying to
preserve.

>As I pointed out above, there's no real reason why end-inclusive
>subranges and 1-based arrays must be related.

Yeah, I agree, given that your iteration constructs are flexible enough,
or that you can roll your own.  I'll make the vague and unsupported
claim that people usually manipulate 1-based arrays using an
end-inclusive model, though.  The articles which prompted me to post
here seemed to be written by people thinking along those lines.

>... we humans tend to think 1-based.

Hmm.  I think I think zero-based.  I don't know if that comes from
having taken too many math courses, or from dealing with icky innards of
low-level software (computer memory is, after all, zero-based).  I guess
you could argue that it took thousands of years of counting things
before people discovered zero, so it can't be all THAT natural.  Maybe
one-based counting IS "natural," but that doesn't necessarily make it
"right."

--Skef
  (Preferred mail address: Wholey@C.CS.CMU.EDU)

turner@sdti.UUCP (Prescott K. Turner) (03/05/88)

As someone said, people do 1-based counting.  (Sorry, I can't find the
reference to quote.)  A child learns to count by pointing at the objects 
and reciting the numbers, as "one, two, three".  The counting process
naturally labels the objects in a one-based fashion.  Are any of you 
zero-based partisans teaching your children to recite "zero, one, two"
and then deduce the cardinality to be three?  

I have seen pros and cons to zero-based arrays, and don't really mind
them in a programming language.  What I do mind is when it spills over
into user interfaces and even human discourse.  When the cursor is blinking
in the first column of my screen, my text editor should report the column 
number as 1, not 0.  (I know, my text editor is wishing it could put
the cursor just to the left of the first column, not smack on top of the
first column, but when it resorts to the latter real problems result.)

And in Dijkstra's letter in the June 1987 CACM, he numbers his points and
his references starting with 0.  I find this annoying and wrong-headed.

Since computer programs must relate to humans at some point, it is
desirable to stick with the human convention that when labeling things
by counting, you start with 1.  And this applies to arrays.

--
Prescott K. Turner, Jr.
Software Development Technologies, Inc.
375 Dutton Rd., Sudbury, MA 01776 USA        (617) 443-5779
UUCP:necntc!necis!mrst!sdti!turner

g-rh@cca.CCA.COM (Richard Harter) (03/07/88)

Most of the fuss and confusion comes about because there are three
different kinds of 'numbers' involved: cardinals, ordinals, and displacement
measures.

A cardinal number is the measure of the size, e.g. this array has five
elements.

An ordinal number is the count sequence, e.g. this is the 1st element,
... this is the 5'th element.

A displacement measure is the distance between an element and the base,
e.g. this element is displaced four elements from the base.

Ordinals match cardinals, e.g. if a sequence has five elements, the last
ordinal corresponds to the cardinal number of the sequence.  There are
fundamental reasons for not mucking with the definition of ordinal, which
I won't go into.

Displacement measures are more useful than ordinals in programming.
The fundamental reason is that names in programming specify addresses,
i.e a[i] is the item located at address 'a[i]'.  Regardless of the language
array basing index, 'i' is used to determine a displacement from the
array address.

The fencepost problem is intrinsic -- in a sequence containing 'n' items
the displacement between the first and the last element in the sequence
is 'n-1'.  This fact is not altered by the array basing scheme used.

-- 

In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die.
	Richard Harter, SMDS  Inc.

faustus@ic.Berkeley.EDU (Wayne A. Christopher) (03/08/88)

If you're looking for mathematical justifications for 0-based or 1-based
arithmetic, remember that the usual definition of n in set theory is
{ n-1, n-2, ... 1, 0 }.  Thus by analogy an array of n elements contains
elements n-1, n-2, ... 1, 0.

Really it's a matter of taste, like hex vs octal.

	Wayne

eric@snark.UUCP (Eric S. Raymond) (03/09/88)

In article <213@sdti.UUCP> Prescott K. Turner, Jr. writes:
>And in Dijkstra's letter in the June 1987 CACM, he numbers his points and
>his references starting with 0.  I find this annoying and wrong-headed.

I agree that it's usually appropriate to have lists for humans start with 1,
and that Dijkstra was being a tad precious (this sort of behavior is an every
once in a while requirement for continued perks and status in the academic
bondage-and-discipline school of language design, doncha know).

However: there is one sort of list-for-humans where zero-basing is appropriate,
in my opinion -- a list that refers to itself. Consider the following examples:

------------------------------------------------------------------------------
This distribution package contains 3 files:

	0. This READ.ME

	1. t&a.txt -- phone numbers and vital statistics of the Dallas
	   Cowboys' cheerleading squad.

	2. hunk.txt -- phone numbers and vital statistics of the Dallas
	   Cowboys.
------------------------------------------------------------------------------

or

------------------------------------------------------------------------------
To install the superduper thingy metacompiler, do the following:

	0. Read this file carefully.

	1. Insert a peanut butter sandwich in your drive slot.

	2. Wrap a maypole with the distribution tape.

------------------------------------------------------------------------------

Also, sometimes a list of instructions will start with one that has sort of
a background prep or general-precondition character, as opposed to the
remainder which are specific well-defined actions. For example:

------------------------------------------------------------------------------
Instructions for going over Niagara Falls in a barrel:

	0. Make sure you really want to do this.

	1. Procure a barrel.

	2. Pad the insides generously.

		(etc.)
------------------------------------------------------------------------------

In these particular sorts of cases, I find zero-indexing a useful cue.

{Mild apologies for the silliness of the examples. I just stumbled out of bed
and haven't showered or breakfasted yet. And BTW I am *not* a football fan...}
-- 
      Eric S. Raymond                  (the mad mastermind of TMN-Netnews)
      UUCP:  {{seismo,ihnp4,rutgers}!cbmvax,sdcrdcf!burdvax,vu-vlsi}!snark!eric
      Post:  22 South Warren Avenue, Malvern, PA 19355    Phone: (215)-296-5718