[bit.listserv.sas-l] Internal collating sequences and character comparisons.

UEPRCK@UNC.BITNET (Bob Kleckner) (02/01/90)

  This is an expansion on Sally Muller's 01/30/90 comments on PROC SORT
and differences between EBCDIC (IBM) and ASCII (everybody else ?)
internal collating sequences for 'character variable' comparisons.
For those unable to RFTM (p. 1040, V.5 Basics) the smallest  to largest
comparison sequence for
    EBCDIC is    a to z  <  A to Z  <  0 to 9    and for
    ASCII  is    0 to 9  <  A to Z  <  a to z    .
As pointed out by Sally, this difference will affect the sort order of
character value BY variables in PROC SORT.
  This difference will also affect character comparisons (p. 224, V.5
Basics) as follows:
    In EBCDIC
               NAME= 'a';
               If NAME < '1';  /* Is true */
    In ASCII
               NAME= 'a';
               If NAME < '1';  /* Is false */ .
  Warning:  Keep this difference in mind if you are moving between
            operating systems.
         :  This may be especially important for those testing SAS
            programs on a PC and then doing production runs on an
            IBM MVS or CMS machine.
  Because someone will ask:  Your IBM PC and PS2 machines use the
                             ASCII internal collating sequence.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Bob Kleckner  (UEPRCK@UNC.BITNET)
Applications Analyst Programmer
Dept. of Epidemiology, UNC
Chapel Hill, NC 27599-7400
919-966-2080

UPHILG@UNC.BITNET (Philip Gallagher) (02/02/90)

Bob Kleckner <UEPRCK@UNC> pointed out very nicely the catastrophes that
may occur when running a SAS program developed on a machine that uses
the ASCII collating sequence on a machine that uses EBCDIC (or,
vice-versa).  He gives the example

   "...  the smallest  to largest comparison sequence for
    EBCDIC is    a to z  <  A to Z  <  0 to 9    and for
    ASCII  is    0 to 9  <  A to Z  <  a to z    . "

Since one of my students correctly pointed out my ignorance last
semester, I would like to tell you that the EBCDIC collating sequence
contains what I choose to consider an oddity that makes me want to say
"That EBCDIC collating sequence is even weirder than I realized!".  I
refer to my version 5 Basics manual, p. 1040:
    Under CMS, OS, & VSE a portion of the EBCDIC collating sequence is:
        abcdefghijklmnopq~stuvwxyz{ABCDEFGHI}J
        KLMNOPQR\STUVWXYZ0123456789

    What idiot would have been naive enough to tell a student (without
looking it up) that a tilda (~) would not appear in the middle of the
small letter sequence and that a right brace (}) and a backslash (\)
would not appear in the middle of the capital letter sequence?
Unfortunately, I know such an idiot;  he was very embarassed when proven
to be wrong.  "I can't believe that ... ."  I suppose I should have
realized it;  I've used the IBM card/folder with the EBCDIC and ASCII
codes on it enough to know about those strange patterns.  Anyway, I
trust you won't get fooled the way my idiot friend did.

                                            Phil Gallagher

HIS@NIHCU.BITNET (Howard Schreier) (02/02/90)

> From:         Philip Gallagher <UPHILG@UNC.BITNET>
> >
> >    "...  the smallest  to largest comparison sequence for
> >     EBCDIC is    a to z  <  A to Z  <  0 to 9    and for
> >     ASCII  is    0 to 9  <  A to Z  <  a to z    . "
>
> Since one of my students correctly pointed out my ignorance last
> semester, I would like to tell you that the EBCDIC collating sequence
> contains what I choose to consider an oddity that makes me want to say
> "That EBCDIC collating sequence is even weirder than I realized!".  I
> refer to my version 5 Basics manual, p. 1040:
>     Under CMS, OS, & VSE a portion of the EBCDIC collating sequence is:
>         abcdefghijklmnopq~stuvwxyz{ABCDEFGHI}J
>         KLMNOPQR\STUVWXYZ0123456789

Note:  the EBCDIC sequence *does*  include  the  lower  case
"r", following the "q" and preceding the tilde.

I see the following implications:

Where sorts are done strictly for internal purposes (such as
MERGEing  data  sets), there shouldn't be much of a problem.
A data set which has been transported, say  from  an  EBCDIC
environment to an ASCII one, may have to be re-sorted.

If a sort  is  done  to  alphabetize  a  list  for  external
presentation,  and the character variables contain a mixture
of upper and lower case, it is a good idea to create one  or
more new variables using the UPCASE function and to actually
sort by these.  This is  true  for  both  ASCII  and  EBCDIC
environments.   It  assures  correct placement of names with
embedded upper case letters (such as VanDyke).

If you need an alphabet string  in  a  character  expression
(for  example,  to use with the VERIFY function), do not try
to generate it with the COLLATE  function.   It  would  work
with ASCII, but not with EBCDIC.

/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
\   Howard Schreier, U.S. Dept. of Commerce, Washington    /
|          (Using Version 5 under IBM OS MVS/XA)           |
/   BITNET: HIS@NIHCU          INTERNET: HIS@CU.NIH.GOV    \
\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/