[bit.listserv.sas-l] ASCII versus EBCDIC Collating Sequence

HART@APLVM.BITNET (Edwin Hart) (02/02/90)

Bob Kleckner raised the issue of ASCII versus EBCDIC collating sequences.
This assumes that sorting is based on the binary value (code point) of a
character.  What Bob says is true

  ASCII  0...9,  A...Z,  a...z
  EBCDIC a...z,  A...Z,  0...9

However, the issue is much deeper than that.

  To begin, let's start with English.  What were you taught in
elementary school?  Well that is different from both ASCII and EBCDIC:

  AaBbCc...Zz

and you did not worry about the numbers.  Frequently the numbers were converted
from digits to words, for example, "300" becomes "three hundred" and the
sorting is done on the "three".  This is a little complicated when sorting
for a data base.  Also what about the sorting exception where the "Mc" is
expanded to sort as "Mac"?  The point is that CORRECT sorting is independent
of the code used to store the characters.  Using either ASCII or EBCDIC codes
to define the sorting sequence results in incorrect English sorts.
Notice that I used the words "English sort".  That is to bring up
another issue:  Correct sorting order depends on the culture.  However, before
addressing non-English sorts, we need some background on codes for non-English
accented characters

  The
International Organization for Standardization (ISO) has defined 8-bit codes
(that expand 7-bit ASCII).  The important one for the U.S. is ISO 8859-1.
One of the nice things about ISO 8859-1 is that U.S. ASCII forms the left half.
The character set for this code contains the characters required in Western
European Languages.  Forty-four countries are covered by the standard.  The
languages include:  English, French, German, Spanish, Portuguese, Italian,
Sweedish, Norwegian, Danish, Flemish, Icelandic.  The standard covers Western
Europe; North, South, and Central America; Australia and New Zealand.  The
ISO 8859-1 character set includes lots of accented characters and while these
have code points ordered relative to each other, they are not interspersed
with the unaccented characters.  In addition, IBM has expanded the country-
dependent EBCDIC codes, like 95-character U.S. EBCDIC, to contain the full
set of characters (191) found in ISO 8859-1.  IBM calls these "Country
Extended Code Pages" or CECPs.  In the CECPs, the accented characters are all
over the place.  Therefore, internationally, the binary code point of a
character has absolutely no relationship to the correct sorting order in any
language and culture.

  Now we are prepared to talk about correct sorting in different countries
and with different languages.  In the previous paragraph, we said that the
binary values of characters in these international codes have no relationship
to correct sorting.  This is one problem.  You might think that the binary
values might be changed to order the characters so that they will sort
properly.  It certainly has a nice logical feel.  Unfortunately, that brings
up the second problem:  Character sorting depends on language and culture.
Some examples will help:

  Spanish considers "ch", n with a tilde, "ll" as separate characters with
    particular sorting sequences.  ("ch" sorts, I believe after "cz".)
  French only considers the accents for sorting if all of the characters in
    the word are the same (homographs) when the accents are removed.  Then
    sorting is done on the accented characters starting with the accent(s) at
    the right end of the word.
  Scandinavian languages use a with a circle as equivalent to "aa", but
    A with a circle sorts after "z".
  The German sharp s character looks like a Greek lower case Beta, but sorts
    as "ss" in German and by the way, the upper-case is "SS" (two "s").
    Obviously, converting a lower-case character to upper-case cannot be done
    by simply adding a constant value to a character or ORing a bit.

These a just some examples that I know about.  More exceptions exist, I'm sure.
This should give you a feel for the wide scope of the sorting problem.

  Are all of these foreign characters important to people living in the U.S.?
For the most part, we are not interested.  However, with the 1992 European
unification, and the emphasis on decreasing the balance of trade by exporting
more, we need to be much more sensitive to other people's cultures.  Even
without the commercial aspects, it is simple courtesy.

  I work for a Labortory of the Johns Hopkins University.  Notice it is "Johns"
not "John".  It has been rumored that the first cuts of the admission process
is for those that can't correctly spell the name of the University/Medical
School.  If true, that is a very high price to pay for mis-spelling.  In many
"foreign" languages, if you leave off the accents, you mis-spell a person's
name.  What happens to junk mail that does not even spell YOUR name correctly?
Do you even bother to open the envelope?

  I had an experience where putting the accents on a man's name (that is
spelling it correctly) was the difference between making a good impression
and being able to work successfully together to a mutually beneficial
conclusion, and immediate rejection.


Edwin Hart
Johns Hopkins University Applied Physics Laboratory
Chairman of the SHARE Task Force that wrote the SHARE Position Paper
"ASCII/EBCDIC Character Set and Code Issues in SAA"

You might also look at ISO8859 @ JHUVM for more insight into some of the
character set and code issues.