[comp.std.internat] Endianism in code sets

eliot@chutney.rtp.dg.com (Topher Eliot) (05/08/91)

There has been some mention of the "problem" of endianism in code sets, in
particular in Unicode.  This seems to me to be completely avoidable, if one
only gives it a little thought.

Endianism problems arise when one is dealing with an item of data that is
larger than the smallest addresseable item (e.g. with a 16-bit quantity, on
a machine with addresseable bytes), and moreover compelled to decide both
1) Which end comes "first" (e.g. for transmittal over a serial communication
channel)
2) Which end has greater significance (e.g. when using the bits as an integer).

With code sets, we are not required to do (2).  Just because something is 16
bits long doesn't mean that it is a 16-bit integer.  We no more have to
agree on which of the bytes is more significant than the other than we have to
agree on which bits are the mantissa and which are the exponent.  We do have
to keep straight which byte comes _first_, but I can't see any problem in
that.

We are very used to thinking that, for example, "0A59" represents an integral
value, with 00001010 in the higher significance bits, and 01011001 in the
low order bits.  We just need to learn to think of (and process) "U+0A59" as
representing the character with 00001010 in the _first_ byte, etc.
This may represent a minor inconvenience for little-endian systems, where 
a text string representation like "\x0A59" would have to be parsed differently
than parsing the integer value 0xA59.  I say "minor" because I'm not doing
the work :-)

Of course, some implementations may choose to _treat_ a character as an
integer, for indexing into a table or whatever.  Such implementations probably
will not be portable to other architectures that are differently-ended, without
appropriate provisions for the reversal.  Also, I've never seen the Unicode
standard, and maybe the authors there did something foolish that comitted
them to treating one byte of each character as being more significant than
the other.

Well, I managed to stir things up with my posting about message numbers.
Is this one controversial, too?

-- 
Topher Eliot                           Data General DG/UX Internationalization
(919) 248-6371        62 T. W. Alexander Dr., Research Triangle Park, NC 27709
eliot@dg-rtp.dg.com                           {backbone}!mcnc!rti!dg-rtp!eliot
Obviously, I speak for myself, not for DG.