[net.misc] Character sets, sorting etc.

blarson@oberon.UUCP (Bob Larson) (10/28/85)

[Let's demonstrate the need by cross-posting to something other than
net.internat]

Some people seem to be under the mistaken impression that ASCII hasn't
changed.  Lower case letters were added in (rather than the shift-in /
shift-out cluge), _ was changed from left arrow to underline, ^ was chaned
from up arrow to carrot, etc.  I don't think adding an eighth bit would
change it enogh to consider it something other than ASCII.

Sorting order in ASCII realy isn't correct either.  Do you like all of your
upper case words coming before your lower case ones?  The sorting order
problem is realy one of replacing a case translator with a table lookup.
Hopefully the table could be make easy to change for working in different
languages.

-- 
Bob Larson
Arpa: Blarson@Usc-Ecl.Arpa
Uucp: {the (mostly unknown) world}!ihnp4!sdcrdcf!oberon!blarson
                 {several select chunks}!sdcrdcf!oberon!blarson

guido@boring.UUCP (11/01/85)

In article <150@oberon.UUCP> blarson@oberon.UUCP (Bob Larson) writes:
>The sorting order
>problem is really one of replacing a case translator with a table lookup.
>Hopefully the table could be make easy to change for working in different
>languages.

YES!  Decent sourting should always be done be table lookup.  As an
example, the Macintosh international utilities package sorts strings
in this way, and the table can be customized to cope with national
variations in the desired dictionary order.  The Mac still uses the
character set's native ordering to determine an ordering for strings
that compare equal using the table (e.g., AA equals aa but precedes
it, while aa precedes AB), so the character set's ordering still
matters.

I don't know whether the Macintosh character set (which is a superset
of ASCII and contains most accented or otherwise slightly modified
characters found in various Western European languages, but does not
support differenty alphabets) would be acceptable as a standard,
but at least it addresses the problems that are encountered most
frequently, it fits in 8 bits and is compatible with ASCII.

(I'm afraid that there is another standard extension of ASCII which
uses up the 8th bit for lots of control codes like cursor up.
However this does not seem to have caught on very much.)

	Guido van Rossum, CWI, Amsterdam (guido@mcvax.UUCP)

franka@mmintl.UUCP (Frank Adams) (11/04/85)

In article <6672@boring.UUCP> guido@mcvax.UUCP (Guido van Rossum) writes:
>I don't know whether the Macintosh character set (which is a superset
>of ASCII and contains most accented or otherwise slightly modified
>characters found in various Western European languages, but does not
>support differenty alphabets) would be acceptable as a standard,
>but at least it addresses the problems that are encountered most
>frequently, it fits in 8 bits and is compatible with ASCII.
>
>(I'm afraid that there is another standard extension of ASCII which
>uses up the 8th bit for lots of control codes like cursor up.
>However this does not seem to have caught on very much.)

There is another standard extension of ASCII which is used for the IBM
PC.  It has a fair number of modified characters; I don't know how it
compares with the Macintosh set.  (It does not have the eastern European
c's, s's, or z's with curlicues; it does have the vaguely similar French
c.)  It also has a fair selection of special characters.  I am not
actually recommending it, just putting it up for consideration.  Given
the source, I think it has to be taken into account.

Frank Adams                           ihpn4!philabs!pwa-b!mmintl!franka
Multimate International    52 Oakland Ave North    E. Hartford, CT 06108

mikeb@inset.UUCP (Mike Banahan) (11/04/85)

In article <150@oberon.UUCP> blarson@oberon.UUCP (Bob Larson) writes:
>Sorting order in ASCII realy isn't correct either.  Do you like all of your
>upper case words coming before your lower case ones?  The sorting order
>problem is realy one of replacing a case translator with a table lookup.
>Hopefully the table could be make easy to change for working in different
>languages.

How right you are Bob! There's lots to it as well. The sorting problem is going
to be a famous one - UNIX hackers have sort of got used (sorry about the pun)
to making do with ASCII sorting order, but it's completely unacceptable in a
number of environments. The current  proposals for ISO 8859 mean that only
English has even poor sorting order based on character encoding - for the
other languages that it is meant to support, such as French, Scandinavian
and so on, it's a non-starter. A whole bunch of accented and further
alphabetic characters are found in the ``top'' 128 character positions,
with absolutely no correlation to their expected sorting position.

Some languages confound this by not being very sure about just what
their collating sequence is: see the item posted by Jaap Akkerhuis which
points out that in Dutch, depending on which of 3 more or less official
alphabets you choose, there may or may not be a ``y''. If there is,
it sorts the same as the character PAIR ``ij''. So the algorithms can't
even work on character-by-character basis. Also, my spies tell me that
in French, when two words are compared, accents are ignored  unless the
word is the same without them, when rules are used to separate the two.
Fun stuff, isn't it?

It's going to take some fancy table-driven stuff to make sense of all this!

As for ranges in Regular Expressions ..... I would love to hear how to
make sense of them.
-- 
Mike Banahan, Technical Director, The Instruction Set Ltd.
mcvax!ukc!inset!mikeb

jack@boring.UUCP (11/05/85)

In article <6672@boring.UUCP> guido@mcvax.UUCP (Guido van Rossum) writes:

>(I'm afraid that there is another standard extension of ASCII which
>uses up the 8th bit for lots of control codes like cursor up.
>However this does not seem to have caught on very much.)
>
>	Guido van Rossum, CWI, Amsterdam (guido@mcvax.UUCP)

As far as I remember, this 8 bit ASCII (which isn't called ASCII, by the
way, but ISO-something-or-other) uses codes 0200-0240 for extra control
functions, and 0241-0277 for extra characters.
I even think that if you take a letter in normal ASCII, and add bit 8, you
still have a letter (be it a different one, of course:-).

Since this code seems to have been more-or-less accepted (I know of at least
two terminals that accept it, or part of it), I guess the MAC will probably
use the same code.

If there is interest, I'll type in the code-table (more-or-less, of course).
-- 
	Jack Jansen, jack@mcvax.UUCP
	The shell is my oyster.

herbie@polaris.UUCP (Herb Chong) (11/07/85)

In article <6681@boring.UUCP> jack@boring.UUCP (Jack Jansen) writes:
>>(I'm afraid that there is another standard extension of ASCII which
>>uses up the 8th bit for lots of control codes like cursor up.
>>However this does not seem to have caught on very much.)
>>
>>	Guido van Rossum, CWI, Amsterdam (guido@mcvax.UUCP)
>
>
>As far as I remember, this 8 bit ASCII (which isn't called ASCII, by the
>way, but ISO-something-or-other) uses codes 0200-0240 for extra control
>functions, and 0241-0277 for extra characters.
>I even think that if you take a letter in normal ASCII, and add bit 8, you
>still have a letter (be it a different one, of course:-).

is this the same 8-bit ASCII code called US8-ASCII that was pushed
by IBM as the 8-bit standard character code when they announced the
360's (a long time ago).

Herb Chong...

I'm still user-friendly -- I don't byte, I nybble....

New net address --

VNET,BITNET,NETNORTH,EARN: HERBIE AT YKTVMH
UUCP:  {allegra|cbosgd|cmcl2|decvax|ihnp4|seismo}!philabs!polaris!herbie
CSNET: herbie.yktvmh@ibm-sj.csnet
ARPA:  herbie.yktvmh.ibm-sj.csnet@csnet-relay.arpa
========================================================================
DISCLAIMER:  what you just read was produced by pouring lukewarm
tea for 42 seconds onto 9 people chained to 6 Ouiji boards.