[net.lang.c] Byte order

g-rh@cca.UUCP (Richard Harter) (04/07/86)

In article <> rbj%icst-cmr@smoke.UUCP writes:
>Bill Crews writes:
>
>	The only reason you got -28672 for BIG instead of nulls is
>	because your machine has backwards byte order.
>
>Sorry Bill, *you're* the one that's got backwards byte order. Little
>Endian is `correct', even tho bucking historical convention.
>
>My reasoning is this: The original byte ordering was done the obvious
>way, Big Endian. If this was so perfect, why would a sane man design
>anything Little Endian? For compelling mathematical reasons!
>You wouldn't number your bits backwards (within a register) would you?
>Admittedly, some people do, but they must not know any better.
>
	Well, no, little-endian came about because the engineers at DEC
who designed the PDP-11 made an arbitrary decision that was not well
thought out.  I will not essay to defend the sanity of DEC engineers,
and cannot recommend that any one else do so (:-)).  It was a bad
decision.

	Consider the following problem.  You have an array of 4 byte
integers.  If you sort the array numerically you get one result.  If
you regard the bytes as characters and sort them lexicographically on
a little endian machine you get a different result.  The reason is that
the most signifigant byte occupies the eight least signifigant bits.
Consistency of signifigance requires that the direction of signifigance
be the same for both bytes and bits.

	Little-endian came about from the idea of making the lsb of a
word be bit 0.  From this it follows that byte 0 should be the lsbyte
of the word.  This is in natural analogy with the idea that word 0
is the l.s. word.  The object is:

bit 0 is the lowest address bit
byte 0 is the lowest address byte
word 0 is the lowest address word

This is the "all addressing flows in the same direction" constraint.
If that were the only constraint either big-endian or little-endian
would be acceptable, provided that it were followed through consistently.
However we have a second constraint, that comparison be valid whether
a word is considered as a character string or as a number.  Given
both constraints, the correct addressing assignment is:

bit 0 is the most signifigant bit
byte 0 is the most signifigant byte.

In short, little-endian was a mistake, is a mistake, and will continue
to be a mistake.

		Richard Harter, SMDS Inc.

ggs@ulysses.UUCP (Griff Smith) (04/07/86)

...
> 	Well, no, little-endian came about because the engineers at DEC
> who designed the PDP-11 made an arbitrary decision that was not well
> thought out.  I will not essay to defend the sanity of DEC engineers,
> and cannot recommend that any one else do so (:-)).  It was a bad
> decision.
.
...
> In short, little-endian was a mistake, is a mistake, and will continue
> to be a mistake.
> 
> 		Richard Harter, SMDS Inc.

As an old PDP-11 hacker, I can't agree with the condemnation of the
DEC engineering decision.  You are looking at it from the perspective
of a modern software engineer who wouldn't think of type punning and
other hacks.  To an assembly language programmer, however, the ability
to use the same address to test the low and high bytes of a device
status register meant that code would be shorter and faster.  It also
increased the number of cases where indirect addressing could be used
with register pointers.  You can't expect the engineers to have
anticipated that high-level languages would discredit these practices.

My own theory about big vs. little end usage is that the mistake was
made hundreds of years ago when merchants started to adopt the Arabic
(as adapted from earlier Hindu sources) number system.  Note that
Arabic is written right to left; note that numbers are written right
to left.  I think the Arabs knew what they were doing; they set the
notation so that the natural computational order followed the
conventional lexical order.  The European merchants missed the point
and copied the notation verbatum instead of compensating for the
opposite lexical convention.

In summary, big-endian was a mistake, but there is no use fighting it.
Any better-informed historical challenges will be cheerfully accepted;
the best data I could get was from an ex-patriot of Iran.
-- 

Griff Smith	AT&T (Bell Laboratories), Murray Hill
Phone:		(201) 582-7736
Internet:	ggs@ulysses.uucp
UUCP:		ulysses!ggs  ( {allegra|ihnp4}!ulysses!ggs )

david@sun.uucp (David DiGiacomo) (04/07/86)

In article <7046@cca.UUCP> g-rh@cca.UUCP (Richard Harter) writes:
>	Well, no, little-endian came about because the engineers at DEC
>who designed the PDP-11 made an arbitrary decision that was not well
>thought out.  I will not essay to defend the sanity of DEC engineers,
>and cannot recommend that any one else do so (:-)).  It was a bad
>decision.

You are not considering the context of the decision.  A little-endian
architecture is highly desirable in an environment which uses
call-by-reference exclusively (e.g. PDP-11 Fortran).  It allows any size
value to be passed to a subroutine expecting a parameter of that size or
smaller.  In a predominantly call-by-value environment (e.g. C) this is
not particularly important, but there are plenty of programmers who have
been burned by { int c = 0; read(fd, &c, 1); }.

-- 
David DiGiacomo  {decvax, ihnp4, ucbvax}!sun!david  david@sun.com
Sun Microsystems, Mt. View, CA  (415) 960-7495

mj@myrias.UUCP (Michal Jaegermann) (04/08/86)

>>I think the Arabs knew what they were doing; they set the
>>notation so that the natural computational order followed the
>>conventional lexical order.  The European merchants missed the point
>>and copied the notation verbatum instead of compensating for the
>>opposite lexical convention.

We have right now A. D. MCMLIIIVI.  Somehow Romans scanned numbers also
in a "wrong" order. :-)  At least if you are writing from left-to-right.

Michal Jaegermann
Myrias Research Corporation
Edmonton, Alberta
...ihnp4!alberta!myrias!mj

markb@sdcrdcf.UUCP (Mark Biggar) (04/08/86)

In article <7046@cca.UUCP> g-rh@cca.UUCP (Richard Harter) writes:
>However we have a second constraint, that comparison be valid whether
>a word is considered as a character string or as a number.

Where in the world did this constraint come from.  This has to be one of
most blatant violations of the spirit of data abstraction and information
hiding I have ever seen.  You don't require this if the types being
considered are signed/unsigned or int/float, why should you if the types
are char-string/int.

Mark Biggar
{allegra,burdvax,cbosgd,hplabs,ihnp4,akgua,sdcsvax}!sdcrdcf!markb

gwyn@brl-smoke.ARPA (Doug Gwyn ) (04/08/86)

In article <7046@cca.UUCP> g-rh@cca.UUCP (Richard Harter) writes:
>	Well, no, little-endian came about because the engineers at DEC
>who designed the PDP-11 made an arbitrary decision that was not well
>thought out.  I will not essay to defend the sanity of DEC engineers,
>and cannot recommend that any one else do so (:-)).  It was a bad
>decision.

One advantage of DEC's decision was that passing numeric quantities
by reference to procedures was simplified; the procedure could use
the lower bits of the parameter without worrying about the size of
the parameter.  This made code generation (e.g. Fortran) easier and
was a real programmer convenience (I certainly exploited it myself).

When the FP11 was designed, DEC screwed up and got the 4-byte (long)
integer format wrong (since fixed in the VAX-11, which was otherwise
most careful to use PDP-11 compatible data formats).  Some Fortrans
kept the bytes in little-endian order anyway, so as not to lose
the above-mentioned advantage, and reshuffled them during FP11
operations.

Many implementors of C on big-endian machines have had to cope with
the pointer conversion problem, which is simpler on little-endians.

>	Consider the following problem.  You have an array of 4 byte
>integers.  If you sort the array numerically you get one result.  If
>you regard the bytes as characters and sort them lexicographically on
>a little endian machine you get a different result.  The reason is that
>the most signifigant byte occupies the eight least signifigant bits.
>Consistency of signifigance requires that the direction of signifigance
>be the same for both bytes and bits.

This is not a practical example.  If it were, you would have to
make the same argument for floating-point numbers, records from
a file, and so on.

>In short, little-endian was a mistake, is a mistake, and will continue
>to be a mistake.

About the only thing that is clear, apart from the better fit of
big-endian to block-oriented operations and of little-endian to
stream-oriented ones, is that this is a "religious issue".

yde@necis.UUCP (David Elins ext. 220) (04/09/86)

What is right.

Whether a lexicographic sort be the equal of a numeric,
or mayhaps the addresses of bytes within a word mimic their significance.

'Tis not ours to decide, we may only give our thanks that there be (so
far) only two ways of attaching significance to these beasties.

A good article on the "Byte Order" problem appeared in the April Fool's
day issue of Computer Language Magazine. The author (Mike Higgins) took
a very sensible stand on what is the "right" way to order bytes.

That issue of CL was much fun. The article on Un*x is particularly good;
it answered a lot of questions for me on why things are the way they are.

rbj@icst-cmr (Root Boy Jim) (04/09/86)

	>	The only reason you got -28672 for BIG instead of nulls is
	>	because your machine has backwards byte order.
	>
	>Sorry Bill, *you're* the one that's got backwards byte order. Little
	>Endian is `correct', even tho bucking historical convention.
	>
	>My reasoning is this: The original byte ordering was done the obvious
	>way, Big Endian. If this was so perfect, why would a sane man design
	>anything Little Endian? For compelling mathematical reasons!
	>You wouldn't number your bits backwards (within a register) would you?
	>Admittedly, some people do, but they must not know any better.
	>
		Well, no, little-endian came about because the engineers at DEC
	who designed the PDP-11 made an arbitrary decision that was not well
	thought out.  I will not essay to defend the sanity of DEC engineers,
	and cannot recommend that any one else do so (:-)).  It was a bad
	decision.

Like I said, `Compelling Mathematical Reasons'. Suppose you were to write
an infinite precision multiply routine. Each character represents a decimal
digit. The routine is called as `mult(x,sx,y,sy)'. The code becomes:

	char *mult(x,sx,y,sy) register char *x,*y; 
	{	register char *p; register int j,k;
		p = calloc(sx + sy,1);
		for (j = 0; j < sx; j++)
			for (k = 0; k < sy; k++)
				if ((p[j + k] += x[j] * y[k]) > 10)
					p[j + k + 1] += p[j + k] / 10,
					p[j + k    ] %= 10;
		return(p);
	}

If you did it Big Endian, you would have to go to the end of the `string'
first and do it backwards with all kinds of god-awful subtraxion expressions.
There are other examples.
	
		Consider the following problem.  You have an array of 4 byte
	integers.  If you sort the array numerically you get one result.  If
	you regard the bytes as characters and sort them lexicographically on
	a little endian machine you get a different result.  The reason is that
	the most signifigant byte occupies the eight least signifigant bits.
	Consistency of signifigance requires that the direction of signifigance
	be the same for both bytes and bits.
	
No. You sort character by character. Character Strings are defined as
Big Endian. Don't mix apples and oranges.

	Little-endian came about from the idea of making the lsb of a
	word be bit 0.  From this it follows that byte 0 should be the lsbyte
	of the word.  This is in natural analogy with the idea that word 0
	is the l.s. word.  The object is:
	
	bit 0 is the lowest address bit
	byte 0 is the lowest address byte
	word 0 is the lowest address word

You Got It!
	
	This is the "all addressing flows in the same direction" constraint.
	If that were the only constraint either big-endian or little-endian
	would be acceptable, provided that it were followed thru consistently.

Perhaps now is a good time to address the issue of floating point. Several
machines, including the VAX, have the following formats:

		Byte 0		Bytes 1-3	Bytes 4-7
	Single:	sign+exp	3*mantissa	
	Double:	sign+exp	3*mantissa	4*mantissa

Allow me to coin the term `Middle Endian', which means that values are
stored with the bits closest to the binary point in lower memory.
This unifys the concept of floating points and integers. Don't even
mention `BCD' or `Packed Decimal', which are more like strings than numbers.

	However we have a second constraint, that comparison be valid whether
	a word is considered as a character string or as a number.  Given
	both constraints, the correct addressing assignment is:
	
	bit 0 is the most signifigant bit

Full Shucking Bit! This is prime example of bogosity. I have to know
how many bits you've get in a register (or word, or whatever) to know
what bit 13 is. Bit n should represent `2 raised to the power n', not
`2 raised to the power (k - n)'. That's what happens when engineers
design hardware instead of mathematicians or computer scientists.

	byte 0 is the most signifigant byte.
	
	In short, little-endian was a mistake, is a mistake, and will continue
	to be a mistake.
	
By now, you have probably seen the posting on `type punning', which allows
inferior functions to manipulate larger call-by-reference arguments as
smaller quantities as long as they remain unsigned with respect to the
smaller quantity. Why do you think DEC defined the `Jump Low Bit' instruxion?
So they could use one test on any of LOGICAL*[124] variables. And TRUE == 1.

			Richard Harter, SMDS Inc.
	
In short, DEC's decision *was* well thought out. Like I said, if there
was no good reason to design Little Endian, it never would have been
designed. Often times, the obvious (and usually first) idea is wrong.
Consider Origin One Indexing vs Origin Zero Indexing for example

P.S. Does anybody remember the `Big Indian' pinball machine?
Does anybody remember pinball machines?

	(Root Boy) Jim Cottrell		<rbj@cmr>
	Baseball in D.C. in `87!

thomas@utah-gr.UUCP (Spencer W. Thomas) (04/09/86)

The best (and most tongue in cheek) discussion I have ever seen about
this issue is a USC/ISI memo written by Danny Cohen, titled "On Holy
Wars and a Plea for Peace".  The whole thing is about 33000 bytes, so I
won't post it, but I will give some excerpts.  [Well, even my excerpts
come to about 1/3 of the original.]

IEN 137                                                      Danny Cohen
                                                             U S C/I S I
                                                            1 April 1980


                   ON HOLY WARS AND A PLEA FOR PEACE



                              INTRODUCTION


This  is  an  attempt to stop a war.  I hope it is not too late and that
somehow, magically perhaps, peace will prevail again.

The latecomers into the arena believe that the issue is:  "What  is  the
proper byte order in messages?".

The root of the conflict lies much deeper than that.  It is the question
of  which  bit  should  travel first, the bit from the little end of the
word, or the bit from the big end of the  word?  The  followers  of  the
former  approach are called the Little-Endians, and the followers of the
latter are called the Big-Endians.  The details of the holy war  between
the  Little-Endians  and  the  Big-Endians  are  documented  in  [6] and
described, in brief, in the Appendix. I recommend that you  read  it  at
this point.
...
In a consistent order, the bit-order, the  byte-order,  the  word-order,
the  page-order, and all the other higher level orders are all the same.
Hence, when considering a serial bit-stream, along a communication  line
for example, the "chunk" size which the originator of that stream has in
mind is not important.

There  are  two  possible  consistent  orders.  One is starting with the
narrow end of each  word  (aka  "LSB")  as  the  Little-Endians  do,  or
starting with the wide end (aka "MSB") as their rivals, the Big-Endians,
do.

In  this note we usually use the following sample numbers: a "word" is a
32-bit quantity and is designated by a "W", and a  "byte"  is  an  8-bit
quantity  which  is  designated  by  a  "C"  (for "Character", not to be
confused with "B" for "Bit)".




                              MEMORY ORDER

The first  word  in  memory  is  designated  as  W0,  by  both  regimes.
Unfortunately, the harmony goes no further.

The Little-Endians assign B0 to the LSB of the words and B31 is the MSB.
The Big-Endians do just the opposite, B0 is the MSB and B31 is the LSB.

By  the  way,  if  mathematicians had their way, every sequence would be
numbered from ZERO up, not from ONE, as is traditionally done.   If  so,
the first item would be called the "zeroth"....

Since  most  computers  are not built by mathematicians, it is no wonder
that some computers designate  bits  from  B1  to  B32,  in  either  the
Little-Endians'  or the Big-Endians' order.  These people probably would
like to number their words from W1 up, just to be consistent.

...
On  the  other  hand,  the  Little-Endians  have  their  view,  which is
different but also self-consistent.

They believe that one should start with the narrow end  of  every  word,
and  that  low  addresses  are  of  lower  order  than  high  addresses.
Therefore they put their words on paper  as  if  they  were  written  in
Hebrew, like this:

                   ...|---word2---|---word1---|---word0---|

When they add the bit order and the byte order they get:

                   ...|---word2---|---word1---|---word0---|
                  ....|C3,C2,C1,C0|C3,C2,C1,C0|C3,C2,C1,C0|
                 .....|B31......B0|B31......B0|B31......B0|

In  this regime, when word W(n) is shifted right, its LSB moves into the
MSB of word W(n-1).

English  text  strings  are  stored  in  the  same order, with the first
character in C0 of W0, the next in C1 of W0, and so on.

This order is very consistent with itself, with the Hebrew language, and
(more importantly) with mathematics, because significance increases with
increasing item numbers (address).

It has the disadvantage that English  character  streams  appear  to  be
written backwards; this is only an aesthetic problem but, admittedly, it
looks funny, especially to speakers of English.

In  order  to  avoid  receiving  strange  comments about this orders the
Little-Endians pretend that they are Chinese, and write the  bytes,  not
right-to-left but top-to-bottom, like:

                        C0: "J"
                        C1: "O"
                        C2: "H"
                        C3: "N"
                        ..etc..

... (Discussion of PDP-11 Floating point unit leads into ...)

However, due to some oversights in the security screening  process,  the
Blefuscuians  took  over,  again.  They assigned, as they always do, the
wide end to the LOWer addresses in memory, and the narrow to the  HIGHer
addresses.

Let   "xy"   and  "abcd"  be  32-  and  64-bit  floating-point  numbers,
respectively.  Let's look how these numbers are stored in memory:

          ddddddddL ccccccccc bbbbbbbbb SMaaaaaaa yyyyyyyyL SMxxxxxxx
     ....|--word5--|--word4--|--word3--|--word2--|--word1--|--word0--|
    .....|-C1-|-C0-|-C1-|-C0-|-C1-|-C0-|-C1-|-C0-|-C1-|-C0-|-C1-|-C0-|
   ......|B15....B0|B15....B0|B15....B0|B15....B0|B15....B0|B15....B0|

Well, Blefuscu scores many points for this. The above reference  in  [3]
does not even try to camouflage it by any Chinese notation.

Encouraged by this success, as minor as it is, the Blefuscuians tried to
pull  another fast one.  This time it was on the VAX, the sacred machine
which all the Little-Endians worship.

Let's look at the VAX order. Again, we look at the way  the  above  data
(with xy being a 32-bit integer) is stored in memory:

               "N" "H"   "O" "J"  SMzzzzzzL SMxxxxxxx yyyyyyyyL
          ...ng2-------|-------long1-------|-------long0-------|
         ....|--word4--|--word3--|--word2--|--word1--|--word0--|
        .....|-C1-|-C0-|-C1-|-C0-|-C1-|-C0-|-C1-|-C0-|-C1-|-C0-|
       ......|B15....B0|B15....B0|B15....B0|B15....B0|B15....B0|

What a beautifully consistent Little-Endians' order this is !!!

So,  what  about  the infiltrators? Did they completely fail in carrying
out their mission?  Since the integer  arithmetic  was  closely  guarded
they  attacked  the  floating  point  and the double-floating which were
already known to be easy prey.

Let's  look, again, at the way the above data is stored, except that now
the 32-bit quantity xy is a floating point  number:  now  this  data  is
organized in memory in the following Blefuscuian way:

               "N" "H"   "O" "J"  SMzzzzzzL yyyyyyyyL SMxxxxxxx
          ...ng2-------|-------long1-------|-------long0-------|
         ....|--word4--|--word3--|--word2--|--word1--|--word0--|
        .....|-C1-|-C0-|-C1-|-C0-|-C1-|-C0-|-C1-|-C0-|-C1-|-C0-|
       ......|B15....B0|B15....B0|B15....B0|B15....B0|B15....B0|

Blefuscu  scores  again.    The  VAX  is  found guilty, however with the
explanation that it tries to be compatible with the PDP11.

Having found themselves there, the  VAXians  found  a  way  around  this
unaesthetic   appearance:  the  VAX  literature  (e.g.,  p. 10  of  [4])
describes this order by using the Chinese top-to-bottom notation, rather
than an embarrassing left-to-right or right-to-left one.  This page is a
marvel.  One has to admire the skillful way in which some quantities are
shown in columns 8-bit wide, some in 16 and other in 32, all in order to
avoid the egg-on-the-face problem.....

By the way, some engineering-type people complain  about  the  "Chinese"
(vertical)  notation  because usually the top (aka "up") of the diagrams
corresponds to "low"-memory (low addresses).  However,  anyone  who  was
brought  up by computer scientists, rather than by botanists, knows that
trees grow downward, having their roots at the top of the page and their
leaves down below. Computer scientists seldom remember  which  way  "up"
really is (see 2.3 of [5], pp. 305-309).

...
                 SUMMARY (of the Memory-Order section)


To  the best of my knowledge only the Big-Endians of Blefuscu have built
systems with a consistent order  which  works  across  chunk-boundaries,
registers,   instructions   and   memories.      I   failed  to  find  a
Little-Endians' system which is totally consistent.

...  (Discussion in similar vein of various transmission protocols)

              SUMMARY (of the Transmission-Order section)


There  are two camps each with its own language.  These languages are as
compatible with each other as any Semitic and Latin languages are.

All Big-Endians can talk to each other with relative ease.

So can all the Little-Endians, even though there  are  some  differences
among the dialects used by different tribes.

There is no middle ground. Only one end can go first.




                               CONCLUSION


Each  camp  tries  to convert the other.  Like all the religious wars of
the past, logic is not the decisive tool. Power is.  This  holy  war  is
not the first one, and probably will not be the last one either.

The  "Be reasonable, do it my way" approach does not work.  Neither does
the Esperanto approach of "let's all switch to yet a new language".

Our communication world may split according to the  language  used.    A
certain  book  (which  is  NOT  mentioned in the references list) has an
interesting story about a similar phenomenon, the Tower of Babel.

Little-Endians are Little-Endians and Big-Endians  are  Big-Endians  and
never the twain shall meet.

We  would like to see some Gulliver standing up between the two islands,
forcing a unified communication regime on all of us.  I do hope that  my
way  will  be chosen, but I believe that, after all, which way is chosen
does not make too much difference.  It is more important to  agree  upon
an order than which order is agreed upon.

How about tossing a coin ???
...


    For ease of reference please note that Lilliput  and  Little-Endians
    both start with an "L", and that both Blefuscu and Big-Endians start
    with a "B".  This is handy while reading this note.

                          R E F E R E N C E S



[1]   Bolt Beranek & Newman.
      Report No. 1822: Interface Message Processor.
      Technical Report, BB&N, May, 1978.

[2]   CCITT.
      Orange Book. Volume VIII.2:  Public Data Networks.
      International Telecommunication Union, Geneva, 1977.

[3]   DEC.
      PDP11 04/05/10/35/40/45 processor handbook.
      Digital Equipment Corp., 1975.

[4]   DEC.
      VAX11 - Architecture Handbook.
      Digital Equipment Corp., 1979.

[5]   Knuth, D. E.
      The Art of Computer Programming. Volume I:  Fundamental
         Algorithms.
      Addison-Wesley, 1968.

[6]   Swift, Jonathan.
      Gulliver's Travel.
      Unknown publisher, 1726.


               OTHER SLIGHTLY RELATED TOPICS (IF AT ALL)

Who's on first?   Zero or One ??

People  start  counting  from  the  number  ONE.  The very word FIRST is
abbreviated into the symbol "1st" which indicates ONE,  but  this  is  a
very modern notation.  The older notions do not necessarily support this
relationship.

In  English  and  French - the word "first" is not derived from the word
"one" but from an  old  word  for  "prince"  (which  means  "foremost").
Similarly,  the  English  word  "second"  is not derived from the number
"two" but from an old word which means "to follow".  Obviously there  is
an  close  relation between "third" and "three", "fourth" and "four" and
so on.

Similarly, in Hebrew, for example, the word "first" is derived from  the
word  "head",  meaning  "the foremost", but not specifically No. 1.  The
Hebrew word for "second" is specifically derived from  the  word  "two".
The same for three, four and all the other numbers.

...

SWIFT's POINT

It may be interesting to notice that  the  point  which  Jonathan  Swift
tried  to  convey  in  Gulliver's Travels in exactly the opposite of the
point of this note.

Swift's point is that the difference between breaking  the  egg  at  the
little-end  and  breaking  it  at the big-end is trivial.  Therefore, he
suggests, that everyone does it in his own preferred way.

We agree that the difference between sending eggs with  the  little-  or
the  big-end first is trivial, but we insist that everyone must do it in
the same way, to avoid anarchy.  Since the difference is trivial we  may
choose either way, but a decision must be made.


-- 
=Spencer   ({ihnp4,decvax}!utah-cs!thomas, thomas@utah-cs.ARPA)

g-rh@cca.UUCP (Richard Harter) (04/10/86)

In article <> ggs@ulysses.UUCP (Griff Smith) writes:
>...
>> 	Well, no, little-endian came about because the engineers at DEC
>> who designed the PDP-11 made an arbitrary decision that was not well
>> thought out.  I will not essay to defend the sanity of DEC engineers,
>> and cannot recommend that any one else do so (:-)).  It was a bad
>> decision.
>.
>...
>> In short, little-endian was a mistake, is a mistake, and will continue
>> to be a mistake.
>> 
>> 		Richard Harter, SMDS Inc.
>
>As an old PDP-11 hacker, I can't agree with the condemnation of the
>DEC engineering decision.  You are looking at it from the perspective
>of a modern software engineer who wouldn't think of type punning and
>other hacks.  To an assembly language programmer, however, the ability
>to use the same address to test the low and high bytes of a device
>status register meant that code would be shorter and faster.  It also
>increased the number of cases where indirect addressing could be used
>with register pointers.  You can't expect the engineers to have
>anticipated that high-level languages would discredit these practices.

	Ah, but I too am an old PDP-11 hacker.  (In fact, my first
	DEC machine was a PDP-1!)  I've done all those good things
	you talk about -- however you could do exactly the same
	things in a correctly designed big-endian machine.  The
	issues at hand have nothing to do with modern software
	engineering and high-level languages.  See below.

>
>My own theory about big vs. little end usage is that the mistake was
>made hundreds of years ago when merchants started to adopt the Arabic
>(as adapted from earlier Hindu sources) number system.  Note that
>Arabic is written right to left; note that numbers are written right
>to left.  I think the Arabs knew what they were doing; they set the
>notation so that the natural computational order followed the
>conventional lexical order.  The European merchants missed the point
>and copied the notation verbatum instead of compensating for the
>opposite lexical convention.
>
>In summary, big-endian was a mistake, but there is no use fighting it.
>Any better-informed historical challenges will be cheerfully accepted;
>the best data I could get was from an ex-patriot of Iran.

	Being that we (You and I and a select few others) are all reasonable
beings let us all eschew slogans and epitaths (especially me) and reason
together.  Perhaps we can find some truth.  Let us see.

	Discussions of little-endian vs big-endian are often muddied by
two collateral issues, the merits of coherent addressing, and the merits
of byte addressing.  Coherent addressing is a neologism I have invented
for this discussion.  All it really means is that all addressing goes in
the same direction.  Thus bit 0 of byte 0 of a word is bit 0 of the word,
etc.  The diagram below illustrates coherent addressing:

Byte:	0 1 2 3 4 5 6 7 8 ....
Int*2:	0   1   2   3   4 ....
Int*4:	0       1       2 ....

Coherent addressing is clearly desirable.  However coherent addressing
has nothing to do, per se, with little-endian or big-endian.  A common
point of confusion in these arguments is to argue for the advantages
of coherent addressing in the belief that one is thereby arguing for
ones favorite position.

	The PDP-11 uses uniform byte addressing.  The advantage of
uniform byte addressing is that the addresses are independent of the
size of the entity being addressed.  The disadvantage of uniform byte
addressing is that it consumes address bits -- one for shorts, two for
longs, and more if we extend the scheme to larger blocks.  This is not
critical in present architectures; it would be if we dropped to the
level of uniform bit addressing.  Again, the merits of little-endian
versus big-endian have nothing to do with the merits of uniform byte
addressing.

	Big-endian versus little-endian only arises when we decide
which bit (byte) of a word is byte 0 -- the most signifigant byte or
the least signifigant byte.  Either choice will do for coherent
addressing.  However the choice does affect two areas, arithmetic
and comparison.  Let us consider the problems of doing arithmetic
on integers of indefinite size.  In that case the natural method
is to represent integers as polynomials in powers of two and do
the arithmetic starting with the lsb and work up (the point is
that the algorithms do not depend on knowing the location of the
msb's of the operands.)  In short, little-endian is the correct
choice for doing arithmetic on integers of indefinite size.

	For comparisons of strings of indefinite size, on the
other hand, the correct choice is big-endian.  The key point is
that the natural method for comparison is to first compare msb's
and work down.  It turns out, if one thinks about it, that the
natural model for strings is the binary fraction rather than
binary polynomial.

	The advantages of little-endian probably show up in
hardware design at the microcode level even though the machine
instructions for arithmetic operate on fixed size operands.  If
this is the case then this might have been a design factor when
the PDP-11 was first being designed.  (Cheap small computers
were a LOT slower in those days, and every trick to gain speed
and simplicity at the hardware level counted.)  At the machine
code level, however, all arithmetic instructions are fixed length
so little-endian/big-endian is, again, irrelevant.  In general
programming, indefinite length arithmetic is a rare animal.

	Comparison of strings, either bits or characters, is
ubiquitous.  And it is here that big-endian is preferable.
In view of this fact I conclude again, little-endian was a
mistake, albeit less of one than I made it out to be.

	Some side issues:  We write left to right because most
people are right handed.  Script style matters here -- if we
are drawing characters the direction doesn't matter.  In
choosing the representation of numbers the issues are the same
as they are for byte ordering.   If size is the issue, big-endian
has the advantage; if arithmetic is the issue, little-endian has
the advantage.

	As a personal note, when I was young I drilled myself
on fast multiplication.  I could write down two numbers and
then write down their product underneath at hand writing speed.
When I was in practice I could do products of five digit numbers
readily.  The trick isn't hard -- you just do cross multiplication.
However you have to do it little-endian style, i.e. work from the
low end up.  

		Richard Harter, SMDS Inc.

greg@utcsri.UUCP (Gregory Smith) (04/10/86)

>> In short, little-endian was a mistake, is a mistake, and will continue

>In summary, big-endian was a mistake, but there is no use fighting it.

bleah... bleah... bleah...

I have only been reading this net.group since Jan86, but I can't
believe this issue hasn't been discussed. My opinion: There ain't
no bad one and good one, just different ones...

Adding/subtracting multi-precision integers:
	Little-end is better because you start at the little end.

Comparing multi-prec. numbers:
	Big-end is better because you start at the big end.

Working with different-sized multi-prec numbers.
	Little-end is better because both numbers start at "1's".
	In Big-End, the first bytes have different weights, so you can't
	even compare them directly.
	LittleEnd also allows, say, an int to be copied to a char by reading
	the byte at the *same* address as the int.

Testing multi-prec ints for >=0/<0
	Big-end wins. just test the first byte.

Testing mult-prec ints for odd/even.
	Little-end wins. just test the first byte. :-)

Working with character strings:
	Big-end is better because lexical comparison ( first char most
	significant ) is equivalent to numerical comparison of mult-prec ints.

Transmitting serial async data:
	Little-end (bitwise) is essential if 7-bit/8-bit formats are
	to be used, ( and abused :-)), since that way the
	LSB location is independent of the # of bits. Only the
	8th bit is in doubt. This also affects CRC calculation:
	if you write software to find a CRC for data
	which is to be serially transmitted,( async or otherwise )
	you should know whether it is to be sent little-end or big-end,
	or burst error detection will suffer. I don't know which is
	easier or if there is a difference.

Human readability:
	Big-end is supposedly better 'cause that's how we write'. I feel
	that this is completely irrelavent and is rather like saying that all
	floating-point calculations should be done in ASCII. Software
	to convert back and forth from ascii to binary is not affected much
	by the order. And if God had wanted humans to spend a lot of time
	reading hex dumps, he would have given us 16 fingers :-). So saying
	'Well little-endian looks funny on a dump' makes no sense.

Bit numbering within bytes:
	Little-end wins, LSB = bit 0 means that bit n = 2^n regardless of
	word width. This argument ( like the parallel one at the byte & word
	level) doesn't carry much weight by itself - practical situations
	like I have given re adding and comparing are more significant.
	However, numbering the LSB 0 is ( almost) universal.
	Ahhh... say the big-endians, look at two consecutive bytes:

	bit #'s:	7 6 5 4 3 2 1 0    7 6 5 4 3 2 1 0
			MSB	    LSB    MSB         LSB
	seen as an int:	............LSB	   MSB............ (little-end int)
	bit #'s:	7 6 5 4 3 2 1 0    151413121110 9 8

	The int bit #'s are out of sequence, says the bigendian. That's
	because the big-endian ( like the little-endian) has written
	the bytes down with the MSB on the left. Just like a human. If you
	write them down the other way, the LSB of the int is on the left,
	and the bottom line reads 0,1,2...14,15. How nice. Remember, in
	many computers, there is nothing joining the msb of one byte to
	the lsb of the next, or vice-versa, except the byte order of ints
	and longer types.
	Some machines ( e.g. some of the 680XX? ) can 'peel' arbitrary
	string of bits from memory. This should be consistent - i.e.
	peeling off 16 bits that happen to align with two bytes should
	give the same result as just reading those two bytes as an int.

I could come up with more, but it's late, and you don't want to read
them.  In fact, I'm a little-end fan, but only just - it doesn't bother
me a bit programming 6809 or 68K in assembler. The point is, whatever
is done, it must be CONSISTENT. Thus I don't like the 1,0,2,3 'hybrid'
byte ordering of longs on a PDP11*. And ( getting obsure ) the BDS-C
(CP/M 80 ) compiler implements longs in a big-end fashion while ints are
little-end - also nasty.

*not in the least bit [no pun] a result of the architecture of the PDP11, BTW.

So please no more postings saying "X is completely stupid and I wouldn't
even be seen in the same building with a machine that orders its bytes
that way". Considered arguments are, as always, welcome.

P.S. Someone said that 'Little-end is a result of the PDP-11'. I don't know,
but I find that hard to believe... except of course in the case of the VAX.
Maybe that's all that was meant. Anybody know? I do know that the following
micros are little-end: 8080,Z80,6502. E-mail please.

-- 
"If you aren't making any mistakes, you aren't doing anything".
----------------------------------------------------------------------
Greg Smith     University of Toronto      UUCP: ..utzoo!utcsri!greg

ggs@ulysses.UUCP (Griff Smith) (04/10/86)

> >>I think the Arabs knew what they were doing; they set the
> >>notation so that the natural computational order followed the
> >>conventional lexical order.  The European merchants missed the point
> >>and copied the notation verbatum instead of compensating for the
> >>opposite lexical convention.
> 
> We have right now A. D. MCMLIIIVI.  Somehow Romans scanned numbers also
> in a "wrong" order. :-)  At least if you are writing from left-to-right.
> 
> Michal Jaegermann
> Myrias Research Corporation
> Edmonton, Alberta
> ...ihnp4!alberta!myrias!mj

Try doing arithmetic with roman numerals!  My impression is that the
Arabic notation was adopted (with much discussion about its sanity,
morality, etc) after merchants conceded that Roman numerals were too
clumsy for business computations.  Considering how the Romans blew the
notation, I wouldn't give them much credit for picking the right digit
order.  Given the entrenched big-endian order in most of the spoken
languages, the merchants may have hesitated to adjust the notation for
fear that it would be hard to speak the numbers (though German numbers
up to 100 are little-endian!).  More likely, they just adopted the
notation verbatum without thinking much about it.
-- 

Griff Smith	AT&T (Bell Laboratories), Murray Hill
Phone:		(201) 582-7736
Internet:	ggs@ulysses.uucp
UUCP:		ulysses!ggs  ( {allegra|ihnp4}!ulysses!ggs )

henry@utzoo.UUCP (Henry Spencer) (04/10/86)

> 	Consider the following problem.  You have an array of 4 byte
> integers.  If you sort the array numerically you get one result.  If
> you regard the bytes as characters and sort them lexicographically on
> a little endian machine you get a different result.  The reason is that
> the most signifigant byte occupies the eight least signifigant bits.
> Consistency of signifigance requires that the direction of signifigance
> be the same for both bytes and bits.

If your machine has unsigned characters and two's-complement integers with
the sign bit in the high bit -- e.g. 360/370 -- then neither byte order
will make the two comparisons come out the same.  So quite apart from all
the other flaws in this argument, it just doesn't work.

Before you start defending signed characters, remember that they were
another arbitrary invention of the pdp11.  And a mistake, too.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry

g-rh@cca.UUCP (Richard Harter) (04/11/86)

In article <> gwyn@brl.ARPA writes:
>
>One advantage of DEC's decision was that passing numeric quantities
>by reference to procedures was simplified; the procedure could use
>the lower bits of the parameter without worrying about the size of
>the parameter.  This made code generation (e.g. Fortran) easier and
>was a real programmer convenience (I certainly exploited it myself).
>
	Point well taken.  I think that most of us will agree that
	it is usually bad practice on the part of the programmers
	to have mismatched types across procedure invocations.
	However code generating programs should exploit all of the
	tricks of the trade.
>
>Many implementors of C on big-endian machines have had to cope with
>the pointer conversion problem, which is simpler on little-endians.
>
	Color me skeptical.  You may be right, but I would like to
	see more details.  In the model of pointers that it occurs
	to me to implement, big-Endian and little-endian come out
	the same.
>
>>	Consider the following problem.  You have an array of 4 byte
>>integers.  If you sort the array numerically you get one result.  If
>>you regard the bytes as characters and sort them lexicographically on
>>a little endian machine you get a different result.  The reason is that
>>the most signifigant byte occupies the eight least signifigant bits.
>>Consistency of signifigance requires that the direction of signifigance
>>be the same for both bytes and bits.
>
>This is not a practical example.  If it were, you would have to
>make the same argument for floating-point numbers, records from
>a file, and so on.
>
	Grumble, grumble.  Sorry, this is a real practical problem.
	Let me give some context.  Suppose you are doing a high speed
	sort of character strings.  Being a clever fellow you have
	read the glossed over parts of the literature and have seen
	that a math sort is O(n) rather than O(n log n).  You also
	notice that it is cheaper to compare 4 bytes at once with
	an 32-bit integer compare than to make 4 byte compares.
	On a BE machine you make the calculations directly; on a
	LE machine you move the bytes into BE order and then do
	the compare.  Yes, Virginia, this is type punning, and all
	those other bad things your CS professor told you were
	no-no's.  But it pays off because you are exploiting the
	effective parallelism of integer address calculation.

>>In short, little-endian was a mistake, is a mistake, and will continue
>>to be a mistake.
>
>About the only thing that is clear, apart from the better fit of
>big-endian to block-oriented operations and of little-endian to
>stream-oriented ones, is that this is a "religious issue".

	Well, I won't exactly retract my words, but there is much
	to what you say.  I will contend that it was a mistake to
	introduce a second standard architecture where only one
	existed before unless the advantages of the second standard
	are clear-cut.  And this is not the case.  LE wins in some
	situations, BE wins in others.  But the disadvantages of
	two standards to deal with outweigh the advantages of either.

			Richard Harter, SMDS Inc.

crowl@rochester.ARPA (Lawrence Crowl) (04/12/86)

Neither big-endian nor little-endian numbers can be sorted lexicographically.
Since the only reason for for attempting to mix character string and number
data is to save time, we must assume that we will have an arbitrary mix if
characters and numbers, including different size numbers.

When sorted lexicographically, the numbers in little-endian format are
sorted based on the low order digit.  For example 32 > 41.  CLEARLY WRONG.

When sorted lexicographically, the numbers in big-endian format are sorted
based on the first high order digit of each number.  This presents a problem
when numbers are of different sizes.  For example, 32 > 126.  CLEARLY WRONG.

In both cases, using two's complement sign representation causes problems.
For example, both have -41 > 41.  CLEARLY WRONG.

The basic problem is that our definition of lexicographic sorting is
incompatible with our (natural) definition of numeric sorting.

CHALLENGE:  Come up with a scheme for representing numbers, and a sorting
scheme in which numbers sort naturally.  Your scheme must deal with variable
length character strings and variable size numbers.  That is, you cannot
requires strings to be padded with nulls, or numbers to be padded with zeros.
Note that many part numbering schemes have intermixed letters and digits.
-- 

Lawrence Crowl             716-275-5766        University of Rochester
                                               Computer Science Department
...!{allegra,decvax,seismo}!rochester!crowl    Rochester, New York,  14627

buls@dataioDataio.UUCP (Rick Buls) (04/15/86)

In article <235@myrias.UUCP> mj@myrias.UUCP (Michal Jaegermann) writes:
>
>We have right now A. D. MCMLIIIVI.  Somehow Romans scanned numbers also
syntax error                 ^^^

>in a "wrong" order. :-)  At least if you are writing from left-to-right.

-- 

				Rick Buls (Data I/O; Redmond, Wa)
				uw-beaver!entropy!dataio!buls

robison@uiucdcsb.CS.UIUC.EDU (04/16/86)

> CHALLENGE:  Come up with a scheme for representing numbers, and a sorting
> scheme in which numbers sort naturally.  Your scheme must deal with variable
> length character strings and variable size numbers.  That is, you cannot
> requires strings to be padded with nulls, or numbers to be padded with zeros.

...
-4 = AAAAZ
-3 = AAAZ
-2 = AAZ
-1 = AZ 
 0 = M 
 1 = Z
 2 = ZZ
 3 = ZZZ
 4 = ZZZZ
...

Arch D. Robison
University of Illinois at Urbana-Champaign

thomas@kuling.UUCP (Thomas H{meenaho) (04/16/86)

In article <7046@cca.UUCP> g-rh@cca.UUCP (Richard Harter) writes:

>	Consider the following problem.  You have an array of 4 byte
>integers.  If you sort the array numerically you get one result.  If
>you regard the bytes as characters and sort them lexicographically on
>a little endian machine you get a different result.  The reason is that
>the most signifigant byte occupies the eight least signifigant bits.
>Consistency of signifigance requires that the direction of signifigance
>be the same for both bytes and bits.
>

	[Deleted text]

>In short, little-endian was a mistake, is a mistake, and will continue
>to be a mistake.
>
>		Richard Harter, SMDS Inc.


I think big-endians was and is a mistake. The only advantage with them is
that it's easy to read a hex dump!

I don't think people usually compare text strings word-wise so that is a
pretty weak argument.

What I find more significant is when you type-cast different types of
integers into other sized integers.
If for instance you have a long and want to convert it to a short, you just
have to read it as a short on a little-endian whereas on a big-endian you
have to adjust the address before you can read it.

I also think Motorola would have made the 68k a little-endian if they were
to design it from scratch knowing what they know today. The 68020 is a
pretty hefty processor but the dynamic bus sizing stinks. It can dynamically
(on a cycle by cycle basis) adjust to data busses varying from 8 to 32 bits.
But as the processor always assume it will get a 32 bit word when it tries
to read one you'll have to put the data on the top bits if you have a 
narrower bus as it is the most significant part that comes first! 
If you give the processor 8 bits you'll have to put it on D24-D31. 
And for 16 bits on D16-D31.
-- 
Thomas Hameenaho, Dept. of Computer Science, Uppsala University, Sweden
Phone: +46 18 138650
UUCP: thomas@kuling.UUCP (...!{seismo,mcvax}!enea!kuling!thomas)

greg@utcsri.UUCP (Gregory Smith) (04/16/86)

In article <17162@rochester.ARPA> crowl@rochester.UUCP (Lawrence Crowl) writes:
>
>When sorted lexicographically, the numbers in little-endian format are
>sorted based on the low order digit.  For example 32 > 41.  CLEARLY WRONG.
>
>When sorted lexicographically, the numbers in big-endian format are sorted
>based on the first high order digit of each number.  This presents a problem
>when numbers are of different sizes.  For example, 32 > 126.  CLEARLY WRONG.
>incompatible with our (natural) definition of numeric sorting.
>
>CHALLENGE:  Come up with a scheme for representing numbers, and a sorting
>scheme in which numbers sort naturally.  Your scheme must deal with variable
>length character strings and variable size numbers.  That is, you cannot
>requires strings to be padded with nulls, or numbers to be padded with zeros.

How about prepending the digit count to big-endian digit strings?
so 32 and 126 become 232 and 3126, and a lexical comparison gives
3126 > 232. Of course, leading zeroes in the significand cannot be used.
If the strings are of different lengths ( and both positive ) then the
longer one is always greater. To support negative numbers, the length
digit could be biased. The following convention supports signed strings
of length 1-5: 

	1  2  3  4  5		< length of number
	5  6  7  8  9		< prefix for +ve numbers
	4  3  2  1  0		< prefix for -ve numbers

The significand digits in a negative number would have to be 9's complemented,
so 32, 126,-37 and -8 are 632, 7126, 362 and 41. There are two 0's, 49 and 50.
( This makes zero an exception to the leading-zero rule ). Note also that
-99 becomes 300,and that 2999 is illegal.
This is starting to resemble a floating-point scheme...
To allow longer strings, just allow a bigger 'length' field - more digits,
or a whole byte.

>Note that many part numbering schemes have intermixed letters and digits.

But what significance do letters have? do you want them to be ignored,
or to be assigned a dictionary order with the numbers?

P.S. this rather obscure scenario hardly seems reason to state that
little- and big-endians are both wrong.
-- 
"If you aren't making any mistakes, you aren't doing anything".
----------------------------------------------------------------------
Greg Smith     University of Toronto      UUCP: ..utzoo!utcsri!greg

friesen@psivax.UUCP (Stanley Friesen) (04/16/86)

In article <7158@cca.UUCP> g-rh@cca.UUCP (Richard Harter) writes:
>>
>>Many implementors of C on big-endian machines have had to cope with
>>the pointer conversion problem, which is simpler on little-endians.
>>
>	Color me skeptical.  You may be right, but I would like to
>	see more details.  In the model of pointers that it occurs
>	to me to implement, big-Endian and little-endian come out
>	the same.

	Could you expand on this! Do you mean that if you cast a
pointer to a long to a pointer to a short and dereference you will get
the *high* order portion on a big-endian machine and the *low* order
portion on a little-endian? Clearly a portability problem, and the the
big-endian behavior is counter-intuitive. Or do you intend that the
pointer always point to the low-order byte even on a big-endian
machine? Then you have to index *backwards* to break the item up into
bytes! Really, the only way to get the rigth semantics on a big-endian
machine is to actually convert pointers when a cast is used.
-- 

				Sarima (Stanley Friesen)

UUCP: {ttidca|ihnp4|sdcrdcf|quad1|nrcvax|bellcore|logico}!psivax!friesen
ARPA: ttidca!psivax!friesen@rand-unix.arpa

crowl@rochester.ARPA (Lawrence Crowl) (04/17/86)

>I wrote:
>> CHALLENGE:  Come up with a scheme for representing numbers, and a sorting
>> scheme in which numbers sort naturally.  Your scheme must deal with variable
>> length character strings and variable size numbers.  That is, you cannot
>> requires strings to be padded with nulls, or numbers to be padded with zeros.

Arch D. Robison wrote:
>...       -4 = AAAAZ   -3 = AAAZ   -2 = AAZ   -1 = AZ 
>  0 = M    1 = Z        2 = ZZ      3 = ZZZ    4 = ZZZZ ...
>

An interesting solution.  I totally failed to consider unary notations.  Just
to keep the answers honest, I will add the requirement that some efficient
form of representation be used.  Radix notation is efficient, but you are not
required to use it.  For extra credit, extend your notation to deal with
non-integral values.  This is easy under current notation if the integral
part is the same.
P
-- 

Lawrence Crowl             716-275-5766        University of Rochester
                                               Computer Science Department
...!{allegra,decvax,seismo}!rochester!crowl    Rochester, New York,  14627

jjhnsn@ut-ngp.UUCP (James Johnson) (04/17/86)

> 	Consider the following problem.  You have an array of 4 byte
> integers.  If you sort the array numerically you get one result.  If
> you regard the bytes as characters and sort them lexicographically on
> a little endian machine you get a different result.

WOW! You mean big endian machines are optimized to sort lists of
four-letter-words ? :-)
--
James Lee Johnson, UTexas Computation Center, Austin, Texas 78712
ARPA:  jjhnsn@ngp.cc.utexas.edu     jjhnsn@ut-ngp.arpa
UUCP:  ihnp4!ut-ngp!jjhnsn  allegra!ut-ngp!jjhnsn  gatech!ut-ngp!jjhnsn
       seismo!ut-sally!jjhnsn  harvard!ut-sally!jjhnsn

greg@utcsri.UUCP (Gregory Smith) (04/20/86)

In article <1104@psivax.UUCP> friesen@psivax.UUCP (Stanley Friesen) writes:
>	Could you expand on this! Do you mean that if you cast a
>pointer to a long to a pointer to a short and dereference you will get
>the *high* order portion on a big-endian machine and the *low* order
>portion on a little-endian? Clearly a portability problem, and the the
>big-endian behavior is counter-intuitive. Or do you intend that the
>pointer always point to the low-order byte even on a big-endian
>machine? Then you have to index *backwards* to break the item up into
>bytes! Really, the only way to get the rigth semantics on a big-endian
>machine is to actually convert pointers when a cast is used.

I strongly disagree. If you have long *x, then (char)*x ( as opposed to
*(char*)x ) is the low-order byte of the pointed-to long and is
portable. If you also have char c, then c= *x is also portable. One
would hope that these would be recognized by the code generator on a
big-endian machine, so that only the single byte would be read.
Besides, your solution would get really weird on things like the PDP
where lo bytes come first in words but lo words are second in longs (
admittedly a silly setup ).

Pointer conversions between 'pointer-to-x' and 'pointer-to-y' should be
no-ops whenever possible, since things like (struct foo *)malloc(...)
are done so often. If the above scheme were to be used, then promoting,
say, the (char *) from malloc to a (long *) would require masking off
the lower 2 ( or whatever ) bits of the address. On a 68K, e.g., the
lower bit would have to be cleared when a (char *) was promoted, and
there is no provision in the instruction set to do this directly to an
address register, which is where a pointer would most likely be. Of
course, on some machines pointers have 'tag' bits which have to be
modified so the new pointer can be dereferenced properly - so a no-op
pointer conversion can't be done regardless of big/little-end.

The following portably strips a long into bytes:

int i;
unsigned char bytes[ sizeof( long )];		/* lo-byte first */
long input;
	for(i=0;i<sizeof(long); ++i)
		bytes[i] = input >> (i<<3)

Nit-pickers can substitute i*CHARBITS for i<<3. Sure it's slow, but it
is portable, and the problem is one that tends to defy portability more
than others. Besides, it could be sped up easily at the expense of
clarity ( that's what clarity is for :-) ). If you really want speed,
put the following in your 'config.h':

On 68K:			On Vax:			On PDP11:
union wombat{		union wombat{		union wombat{
    long longish;	    long longish;	    long longish;
    struct chs{		    struct chs{		    struct chs{
	char ch3,ch2;		char ch0,ch1;		char ch2,ch3;
	char ch1,ch0;		char ch2,ch3;		char ch0,ch1;
    }charish;		    }charish;		    }charish;
};			};			};

If you declare 'union wombat x' then x.longish is a long
and x.charish.ch3 is the high-order byte, and is fast. Given
long *lp, you could do ((struct chs*)lp)->ch2 which would portably
refer to the second-most-significant byte. Not as general, but it *is*
fast.

In summary, I don't feel that the portability problem you talk about
will rear its nasty head very often, and it can be dealt with using the
tools provided, just as other portability problems can be. The trouble
you have to go to depends on (1) how portable you want it to be (2) how
fast you want it to be (3) how many intrinsically hard-to-port things
you want to do. Stripping specific bytes out of a long by addressing
tricks *is* intrinsically non-portable and is not a very common thing,
anyway.
-- 
"If you aren't making any mistakes, you aren't doing anything".
----------------------------------------------------------------------
Greg Smith     University of Toronto      UUCP: ..utzoo!utcsri!greg

guido@boring.UUCP (04/20/86)

In article <1104@psivax.UUCP> friesen@psivax.UUCP (Stanley Friesen) writes:
>	Could you expand on this! Do you mean that if you cast a
>pointer to a long to a pointer to a short and dereference you will get
>the *high* order portion on a big-endian machine and the *low* order
>portion on a little-endian? Clearly a portability problem, and the the
>big-endian behavior is counter-intuitive.

You're right.  Casting "pointer to long" to "pointer to short" in C has
a portability problem.  This can't be solved by outlawing big-endian
architectures, nor by 'fixing' the C compilers.  Non-portable casts have
have an effect that is obvious or at least understandable when you know
the underlying hardware architecture of a given implementation; never
mind intuitivity, and you correctly describe the effect of this
particular example.

If you want to write portable C, don't use non-portable constructs.
If you want to flame big-endianness, don't use C portability as an
argument.

You're right about the counter-intuitivity of the effect of
dereferencing a wrong-size pointer in assembly language, and (as has
been said numerous times by now) this was one of the resons for its
use in the PDP-11.

-- 
	Guido van Rossum, CWI, Amsterdam <guido@mcvax.UUCP>

crowl@rochester.ARPA (Lawrence Crowl) (04/21/86)

In article <2568@utcsri.UUCP> greg@utcsri.UUCP (Gregory Smith) writes:
>In article <17162@rochester.ARPA> crowl@rochester.UUCP (Lawrence Crowl) writes:
>>CHALLENGE:  Come up with a scheme for representing numbers, and a sorting
>>scheme in which numbers sort naturally.  Your scheme must deal with variable
>>length character strings and variable size numbers.  That is, you cannot
>>requires strings to be padded with nulls, or numbers to be padded with zeros.
>How about prepending the digit count to big-endian digit strings?
>so 32 and 126 become 232 and 3126, and a lexical comparison gives
>3126 > 232. Of course, leading zeroes in the significand cannot be used. ...
The problem with this scheme is that it assumes a constant size "length" field.
This is the best that I have yet seen though.
 
>>Note that many part numbering schemes have intermixed letters and digits.
>But what significance do letters have? do you want them to be ignored,
>or to be assigned a dictionary order with the numbers?
This is part of the problem!

>P.S. this rather obscure scenario hardly seems reason to state that
>little- and big-endians are both wrong.

Sort the following list of "words": 302, 3B2, -42, 132, 26, AD, ab, 74LS00

--The damn poster program wants more lines.
--It says "more included lines than new next".
--I've said all I want to say.
--You want more lines?
--Take this!
--Feed it to your line eater!
--Burp!
-- 

Lawrence Crowl             716-275-5766        University of Rochester
                                               Computer Science Department
...!{allegra,decvax,seismo}!rochester!crowl    Rochester, New York,  14627

friesen@psivax.UUCP (04/21/86)

In article <2590@utcsri.UUCP> greg@utcsri.UUCP (Gregory Smith) writes:
>
>I strongly disagree. If you have long *x, then (char)*x ( as opposed to
>*(char*)x ) is the low-order byte of the pointed-to long and is
>portable.

	What I was trying to say is that *both* should be portable and
equivalent. Actually it is when going the other way, from a pointer to
a small integer to a pointer to a large integer that you *have* to use
the *(long *)x form, otherwise you will not get the whole thing.

>Besides, your solution would get really weird on things like the PDP
>where lo bytes come first in words but lo words are second in longs (
>admittedly a silly setup ).

	What solution? I was only talking about the *semantics* of a
cast. I was saying that pointer conversions should result in pointers
to rational objects, not wierd things like the the high order fragment
of a larger entity. This is a statement about desired behavior *not*
implementation.
>
>Pointer conversions between 'pointer-to-x' and 'pointer-to-y' should be
>no-ops whenever possible, since things like (struct foo *)malloc(...)
>are done so often. If the above scheme were to be used, then promoting,
>say, the (char *) from malloc to a (long *) would require masking off
>the lower 2 ( or whatever ) bits of the address.

	I agree, pointer conversions *should* be no-ops. The article I
was responding to had stated that big-endian pointers and little-endian
pointer conversions could *both* be no-ops and I was asking HOW?. As
far as I can see, in order to get rational, intuitive *semantics* on a
big-endian machine ponter conversions *cannot* be no-ops. That is all
I was saying. So again, on a big-endian machine, how do you get the
desired *semantics* and still leave pointer conversions as no-ops?
>
>The following portably strips a long into bytes:
>
>int i;
>unsigned char bytes[ sizeof( long )];		/* lo-byte first */
>long input;
>	for(i=0;i<sizeof(long); ++i)
>		bytes[i] = input >> (i<<3)
>
	And I am saying that the following *should* be portable, and
that any implementation that it doesn't work on is brain-damaged.

	register int i;
	unsigned char bytes[ sizeof( long )];		/* lo-byte first */
	long input;
	register char *cvptr;

	for(cvptr = (char *)&input, i = 0; i < sizeof(long); i++)
		bytes[i] = cvptr[i];

-- 

				Sarima (Stanley Friesen)

UUCP: {ttidca|ihnp4|sdcrdcf|quad1|nrcvax|bellcore|logico}!psivax!friesen
ARPA: ttidca!psivax!friesen@rand-unix.arpa

franka@mmintl.UUCP (Frank Adams) (04/22/86)

In article <17312@rochester.ARPA> crowl@rochester.UUCP (Lawrence Crowl) writes:
>>I wrote:
>>>CHALLENGE:  Come up with a scheme for representing numbers, and a sorting
>>>scheme in which numbers sort naturally.  Your scheme must deal with variable
>>>length character strings and variable size numbers.  That is, you cannot
>>>require strings to be padded with nulls, or numbers to be padded with zeros.
>
>Arch D. Robison wrote:
>>...       -4 = AAAAZ   -3 = AAAZ   -2 = AAZ   -1 = AZ 
>>  0 = M    1 = Z        2 = ZZ      3 = ZZZ    4 = ZZZZ ...
>>
>
>An interesting solution.  I totally failed to consider unary notations.  Just
>to keep the answers honest, I will add the requirement that some efficient
>form of representation be used.  Radix notation is efficient, but you are not
>required to use it.  For extra credit, extend your notation to deal with
>non-integral values.  This is easy under current notation if the integral
>part is the same.

Basically, you use a unary notation (pure or modified) to indicate how long
the number is, and the number follows.  Using the scheme above, 257 would
be ZZZ257.  One and a half would be Z15.  This is essentially floating point
notation.

This has to be modified a bit to deal with negative numbers.  One possibility
is to allocate separate "exponent" spaces for positive and negative numbers.
E.g., for positive numbers, the exponents are:

...     -4 = NNNNT   -3 = NNNT   -2 = NNT   -1 = NT
0 = T    1 = ZT       2 = ZZT     3 = ZZZT   4 = ZZZZT

For negative numbers, the exponents are:

...     -4 = MMMMG   -3 = MMMG   -2 = MMG   -1 = MG
0 = G    1 = AG       2 = AAG     3 = AAAG   4 = AAAAG

Also, for negative numbers, the digits should be in (9's) complement form.
Thus -1 would be AG8.  A special notation, e.g. NA, is required for zero.

This can of course be done somewhat better.  A higher radix can be used, and
the exponent encoding can be further optimized.  But the point is to describe
the approach, not to work out all the details.

Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
Multimate International    52 Oakland Ave North    E. Hartford, CT 06108

g-rh@cca.UUCP (Richard Harter) (04/24/86)

>In article <17162@rochester.ARPA> crowl@rochester.UUCP (Lawrence Crowl) writes:
>
>CHALLENGE:  Come up with a scheme for representing numbers, and a sorting
>scheme in which numbers sort naturally.  Your scheme must deal with variable
>length character strings and variable size numbers.  That is, you cannot
>requires strings to be padded with nulls, or numbers to be padded with zeros.

Presented without endorsement:

Addressing nomenclature -- big endian (i.e. msb is bit 0, etc.)

Item is divided into five parts:

Part I - sign bit (0 is negative, 1 is postive)

Part II - Size of field size in unary, i.e. string of 1's terminated by 0

Part III - Field size, in binary (Field size is size of field to left of
           implicit decimal point.

Part IV -  Portion of string to left of implied decimal point, encoded in
	   bytes.

Part V  - Portion of string to right of implied decimal point, encoded in
          ternary, using two bits per ternary digit, as follows:

	00   Sentinel, terminating right field
	01   ternary 0
        02   ternary 1
	03   ternary 2

All numbers are represented as ascii characters.

The point of the ternary representation (or something equivalent) is
that it provides for a sentinel for variable length strings that sorts
lexicographically.

If an integer is to be considered as a number and compared in that fashion
it goes in the left field -- if it is to be considered as a character string
it goes in the right field.

	I believe that this representation meets the challenge.
I sincerely hope that I never have to work with it.

	Richard Harter, SMDS Inc.

g-rh@cca (04/24/86)

Subject: Re: Byte order (or you are both wrong)

>In article <17162@rochester.ARPA> crowl@rochester.UUCP (Lawrence Crowl) writes:
>
>CHALLENGE:  Come up with a scheme for representing numbers, and a sorting
>scheme in which numbers sort naturally.  Your scheme must deal with variable
>length character strings and variable size numbers.  That is, you cannot
>requires strings to be padded with nulls, or numbers to be padded with zeros.

Presented without endorsement:

Addressing nomenclature -- big endian (i.e. msb is bit 0, etc.)

Item is divided into five parts:

Part I - sign bit (0 is negative, 1 is postive)

Part II - Size of field size in unary, i.e. string of 1's terminated by 0

Part III - Field size, in binary (Field size is size of field to left of
           implicit decimal point.

Part IV -  Portion of string to left of implied decimal point, encoded in
	   bytes.

Part V  - Portion of string to right of implied decimal point, encoded in
          ternary, using two bits per ternary digit, as follows:

	00   Sentinel, terminating right field
	01   ternary 0
        02   ternary 1
	03   ternary 2

All numbers are represented as ascii characters.

The point of the ternary representation (or something equivalent) is
that it provides for a sentinel for variable length strings that sorts
lexicographically.

If an integer is to be considered as a number and compared in that fashion
it goes in the left field -- if it is to be considered as a character string
it goes in the right field.

	I believe that this representation meets the challenge.
I sincerely hope that I never have to work with it.

	Richard Harter, SMDS Inc.

rose@think (04/26/86)

In article <1117@psivax.UUCP> friesen@psivax.UUCP (Stanley Friesen) writes:
>In article <2590@utcsri.UUCP> greg@utcsri.UUCP (Gregory Smith) writes:
>>
>>I strongly disagree. If you have long *x, then (char)*x ( as opposed to
>>*(char*)x ) is the low-order byte of the pointed-to long and is
>>portable.
>
>	What I was trying to say is that *both* should be portable and
>equivalent.
>.....
>	And I am saying that the following *should* be portable, and
>that any implementation that it doesn't work on is brain-damaged.
>
>	register int i;
>	unsigned char bytes[ sizeof( long )];		/* lo-byte first */
>	long input;
>	register char *cvptr;
>
>	for(cvptr = (char *)&input, i = 0; i < sizeof(long); i++)
>		bytes[i] = cvptr[i];
>
>				Sarima (Stanley Friesen)

Well, this is certainly false for all existing C's on big-endian machines.
Even if you make the pointer "(char *)&input" point to the last byte of
the datum, when you index on it, say by referencing "cvptr[1]", you get
bytes *past* the last byte of the long datum.

But the funny thing is, a C compiler *could* be designed to make Sarima's
code work on a big-endian machine.  Pointer ordering and arithmetic could
be defined, in a self-consistent way, *backwards* from the machine-word
ordering.  Arrays and structs would be allocated backward in memory.
(As a bonus, conversion between pointers and ints could involve a negation,
but portable C makes no such requirements.)  In essence, the abstract
addressing structure imposed by this hypothetical compiler would turn
our misguided big-endian machine into a virtual little-endian.

This hack fails for the waverers, such as PDP-11's.  (However, you
could complement the 1s bit before and after index arithmetic?)
Be wrong, but be consistent!

Disclaimers:  This was not a serious suggestion to the compiler designers
of the world.  It certainly has nothing to do with the official policy
of Thinking Machines Corporation.  :-) :-) :-)

[P.S. Sarima, if this was your meaning all along, sorry to steal your
      thunder; I must have missed an article.]

[P.P.S. Local work crunch precludes retro-flame on &array responses.
	But stay tuned, netters.]
-- 
----------------------------------------------------------
John R. Rose		     Thinking Machines Corporation
245 First St., Cambridge, MA  02142    (617) 876-1111 X270
rose@think.arpa				  ihnp4!think!rose

crowl@rochester (04/26/86)

In article <1298@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
]In article <17312@rochester.ARPA> crowl@rochester.UUCP (Lawrence Crowl) writes:
]]CHALLENGE:  Come up with a scheme for representing numbers, and a sorting
]]scheme in which numbers sort naturally.  Your scheme must deal with variable
]]length character strings and variable size numbers.  That is, you cannot
]]require strings to be padded with nulls, or numbers to be padded with zeros.
]
]Basically, you use a unary notation (pure or modified) to indicate how long
]the number is, and the number follows.  Using the scheme above, 257 would
]be ZZZ257.  One and a half would be Z15.  This is essentially floating point
]notation. ...  This has to be modified a bit to deal with negative numbers. ...
]Also, for negative numbers, the digits should be in (9's) complement form. ...
]This can of course be done somewhat better.  A higher radix can be used, and
]the exponent encoding can be further optimized.  But the point is to describe
]the approach, not to work out all the details.

This seems good to me.  If there are no radical alternatives, lets close the
issue.
-- 

Lawrence Crowl             716-275-5766        University of Rochester
                                               Computer Science Department
...!{allegra,decvax,seismo}!rochester!crowl    Rochester, New York,  14627

greg@utcsri (04/26/86)

In article <1117@psivax.UUCP> friesen@psivax.UUCP (Stanley Friesen) writes:
>In article <2590@utcsri.UUCP> greg@utcsri.UUCP (Gregory Smith) writes:
>>...
>	And I am saying that the following *should* be portable, and
>that any implementation that it doesn't work on is brain-damaged.
>
>	register int i;
>	unsigned char bytes[ sizeof( long )];		/* lo-byte first */
>	long input;
>	register char *cvptr;
>
>	for(cvptr = (char *)&input, i = 0; i < sizeof(long); i++)
>		bytes[i] = cvptr[i];
>
Sorry - I thought you were suggesting that big-endian pointer conversions
should set the lower bits (when pointing to a smaller object) and clear
them (when pointing to a larger one). What you are actually saying is
that any implementation which is not strictly little-endian is 'brain-
damaged' (Including (a) PDP11 _longs_, (b) 68000). I will agree that (a)
is brain-damaged, but I think the adjective is a little strong for (b).
I am a little-end fan myself, but I can face reality, and the reality is
there will be big-endian machines around that are still worth using, and
that the 'problem' can be dealt with so easily that one needn't damn
the 68K to eternal flames on this one point.

-- 
"For every action there is an equal and opposite malfunction"
----------------------------------------------------------------------
Greg Smith     University of Toronto      UUCP: ..utzoo!utcsri!greg

rbj@icst-cmr (04/30/86)

> In article <1117@psivax.UUCP> friesen@psivax.UUCP (Stanley Friesen) writes:
> >In article <2590@utcsri.UUCP> greg@utcsri.UUCP (Gregory Smith) writes:
> >>...
> >	And I am saying that the following *should* be portable, and
> >that any implementation that it doesn't work on is brain-damaged.
> >
> >	register int i;
> >	unsigned char bytes[ sizeof( long )];		/* lo-byte first */
> >	long input;
> >	register char *cvptr;
> >
> >	for(cvptr = (char *)&input, i = 0; i < sizeof(long); i++)
> >		bytes[i] = cvptr[i];
> >
> Sorry - I thought you were suggesting that big-endian pointer conversions
> should set the lower bits (when pointing to a smaller object) and clear
> them (when pointing to a larger one). What you are actually saying is
> that any implementation which is not strictly little-endian is 'brain-
> damaged' (Including (a) PDP11 _longs_, (b) 68000). I will agree that (a)
> is brain-damaged, but I think the adjective is a little strong for (b).
> I am a little-end fan myself, but I can face reality, and the reality is
> there will be big-endian machines around that are still worth using, and
> that the 'problem' can be dealt with so easily that one needn't damn
> the 68K to eternal flames on this one point.
> 
> "For every action there is an equal and opposite malfunction"
> ----------------------------------------------------------------------
> Greg Smith     University of Toronto      UUCP: ..utzoo!utcsri!greg

To all you Big Endian fans out there: What's your problem? You guys
are the ones saying that the first byte is the most significant, so
why are y'all bitchin that you don't get the low byte like on *sane*
machines? Axually, I like the 680[012]0 also. Byte order doesn't really
bother me either way except as an academic issue.

I don't see the need or the possibility for your example to be both
portable and work the way you described. The two are at odds.
Remember, C does what you tell it to do, not what you think it should do.

	(Root Boy) Jim Cottrell		<rbj@cmr>
	"One man gathers what another man spills"

jsdy@hadron.UUCP (Joseph S. D. Yao) (05/08/86)

In article <5056@think.ARPA> rose@think.UUCP (John Rose) writes:
>In article <1117@psivax.UUCP> friesen@psivax.UUCP (Stanley Friesen) writes:
>>	And I am saying that the following *should* be portable, and
>>that any implementation that it doesn't work on is brain-damaged.
>>	register int i;
>>	unsigned char bytes[ sizeof( long )];		/* lo-byte first */
>>	long input;
>>	register char *cvptr;
>>	for(cvptr = (char *)&input, i = 0; i < sizeof(long); i++)
>>		bytes[i] = cvptr[i];
>Well, this is certainly false for all existing C's on big-endian machines.
>Even if you make the pointer "(char *)&input" point to the last byte of
>the datum, when you index on it, say by referencing "cvptr[1]", you get
>bytes *past* the last byte of the long datum.

I don't see rose's point.  Attend:
	0:	0123		3210
		big-endian	little-endian
In both cases, if &input == 0, and (cvptr = (char *) &input) == 0,
then cvptr[0-3] will map in the same order to bytes[0-3]; and the
copy succeeds.  Perhaps he is thinking that (char *) &input == 3,
and (short int *) &input == 2?  This is exactly the behaviour that
big-endians do  n o t  do, and which little-endian PDP-11 and VAX-11
programmers despair of because thay can't perform type punning.

Incidentally, don't call an 11 a PDP, because there are radically
different architectures on the PDP-1,-4,-5,-6,-7,-8,-9,-10,-12,-15,
-16,-20,...	;-)
-- 

	Joe Yao		hadron!jsdy@seismo.{CSS.GOV,ARPA,UUCP}