[comp.std.internat] International Collating Sequence

crowl@cs.rochester.edu (Lawrence Crowl) (09/28/87)

Several posters in this group have pointed out the difficulty in satisfying
the many national collating sequences within an international character code.
There is a further problem in that if I wish to collate words from several
languages (say a list of authors), then I must pick a collating method that
probably does not include all characters.  In short, I may be forced to use
some local, non-standard collating sequence to handle all entries.  How does
your bibliographic database handle foreign authors?  Does it drop accents that
are not in your native alphabet?

I submit that we need not only an international character code, but an
international collating sequence as well.  Such a sequence should be very
simple.  There should be no "double letter" rules or unnatural separation
of accented letters from base letters.  I see no reason not to embed the
collating sequence within the numeric codes for the characters.

For example, a character set meeting these criteria might have the following
ordering:

   A a `A `a "A "a .A .a  ...  AE ae B b C c ,C ,c D d E e 'E 'e `E `e ...

No international standard based on USASCII can meet this alphabet and still
embed the collating sequence within the character codes.

Note that many letter forms in Latin, Greek, and Cryllic are the same.  It
is possible to merge these three alphabets into a single alphabet.  This will
involve some re-ordering of the letters from at least two of the original
alphabets, but not a great deal.  I do not know whether this is a good idea or
not, I just thought I would mention it.  Of course, we still have Arabic,
Hebrew, Kanji, Kana, etc. to incorporate.

Perhaps a better approach is to start from scratch with a new character
standard.  One designed from the start to accomodate international needs.
I am willing to translate my files to a new character set.  Are you?

-- 
  Lawrence Crowl		716-275-9499	University of Rochester
		      crowl@cs.rochester.edu	Computer Science Department
...!{allegra,decvax,rutgers}!rochester!crowl	Rochester, New York,  14627

sandi@apollo.uucp (Sandra Martin) (09/29/87)

Lawrence Crowl @ U of Rochester, CS Dept, Rochester, NY writes:
>I submit that we need not only an international character code, but an
>international collating sequence as well.  Such a sequence should be very
>simple.  There should be no "double letter" rules or unnatural separation
>of accented letters from base letters.  I see no reason not to embed the
>collating sequence within the numeric codes for the characters.
>
>For example, a character set meeting these criteria might have the following
>ordering:
>
>   A a `A `a "A "a .A .a  ...  AE ae B b C c ,C ,c D d E e 'E 'e `E `e ...

I agree that an international collating sequence would be nice, but you
can't make arbitrary rules against double letters and separating characters
with diacriticals. In Spanish, 'ch' sorts between 'c' and 'd' in the
alphabet (likewise, 'll' comes between 'l' and 'm'). How would your
sequence handle this situation? You cannot ignore it just because it's
inconvenient.

In the Swedish alphabet, a(ring), a", and o" appear AFTER 'z'. They DO
NOT sort with the unaccented a's and o's. In Danish, the 'ae' ligature
also appears near the end of the alphabet. Why should an international
collating sequence fail to recognize these realities? A few months back,
Erland Sommarskog of ENEA Data in Stockholm posted an article to this
newsgroup in which he noted (perhaps partly in jest) that if the Swedes
had invented computers, English-speakers would have had to accept the
fact that 'v' and 'w' are equivalent. As an English speaker, I'm sure
you wouldn't want to accept such a restriction. Why should people from
other countries have to accept an unnatural order for their characters?

The fact is that there is no way to construct ONE international collating
sequence. In German, the a" sorts with the other a's. In Swedish, it
sorts at the end of the alphabet. So whatever solution is invented, it
must be flexible enough to handle these realities.

   Sandra Martin, Apollo Computer
   UUCP:  ...{mit-erl,mit-eddie,yale,uw-beaver,decvax}!apollo!sandi
   ARPA:  apollo!sandi@eddie.mit.edu

crowl@cs.rochester.edu (Lawrence Crowl) (09/29/87)

In article <379119b2.b88e@apollo.uucp> sandi@apollo.uucp (Sandra Martin) writes:
)Lawrence Crowl @ U of Rochester, CS Dept, Rochester, NY writes:
)>I submit that we need not only an international character code, but an
)>international collating sequence as well.  Such a sequence should be very
)>simple.  There should be no "double letter" rules or unnatural separation
)>of accented letters from base letters.
)
)I agree that an international collating sequence would be nice, but you can't
)make arbitrary rules against double letters and separating characters with
)diacriticals.  In Spanish, 'ch' sorts between 'c' and 'd' in the alphabet
)(likewise, 'll' comes between 'l' and 'm').  How would your sequence handle
)this situation?  You cannot ignore it just because it's inconvenient.
)[Followed by lots of examples of incompatabilty between national collating
)sequences.]

You've missed my point.  No international character code will support the
various national collating sequences.  If we have an international collating
sequence, ignoring national sequences, then we can have a very simple coding
scheme which naturally supports a simple collating sequence.  An international
sequence tells me what to do when collating foreign words, etc.  This leaves a
programmer with two choices, sorting based on the international sequence or
sorting based on his or her national sequence.  Any international character
code will make the latter difficult, but the former can be easy with a good
character code and collating sequence pair.

I am not suggesting forcing people to abandon national sequences, just giving
them an international alternative that is easy and efficient.
-- 
  Lawrence Crowl		716-275-9499	University of Rochester
		      crowl@cs.rochester.edu	Computer Science Department
...!{allegra,decvax,rutgers}!rochester!crowl	Rochester, New York,  14627

oster@dewey.soe.berkeley.edu (David Phillip Oster) (09/30/87)

Since different nations have different, incompatible collating
sequences and any _international_ collating system could not
simultaneously sort the same set into two lists at once, an
_international_ collating sequence must be different from the national
collating sequence.  Since we aren't shackled by by the existing
national collating sequences, we might as well make our new,
international one simple. 

Hey, why not just sort by the numeric value of the ASCII code of the
characters? That way, all our existing English language software
already does the "right" thing.

crowl@cs.rochester.edu (Lawrence Crowl) (09/30/87)

In article <21031@ucbvax.BERKELEY.EDU> oster@dewey.soe.berkeley.edu.UUCP
(David Phillip Oster) writes:
>Hey, why not just sort by the numeric value of the ASCII code of the
>characters?  That way, all our existing English language software already does
>the "right" thing.

It doesn't do the "right" thing.  No one I know says "Z" < "a".  And what do we
do about all the modified characters?  Tack them on at the end in some random
order?  Ugly!  Surely we can do something halfway rational.
-- 
  Lawrence Crowl		716-275-9499	University of Rochester
		      crowl@cs.rochester.edu	Computer Science Department
...!{allegra,decvax,rutgers}!rochester!crowl	Rochester, New York,  14627

jeg@hector.UUCP (Judy Grass) (09/30/87)

>scheme which naturally supports a simple collating sequence.  An international
>sequence tells me what to do when collating foreign words, etc.  This leaves a
>programmer with two choices, sorting based on the international sequence or
>sorting based on his or her national sequence.  Any international character
>code will make the latter difficult, but the former can be easy with a good
>character code and collating sequence pair.
>
>I am not suggesting forcing people to abandon national sequences, just giving
>them an international alternative that is easy and efficient.
>-- 
>  Lawrence Crowl		716-275-9499	University of Rochester
>		      crowl@cs.rochester.edu	Computer Science Department
>...!{allegra,decvax,rutgers}!rochester!crowl	Rochester, New York,  14627

Even given an international sequence, you will still have a problem.  Your
sequence is based on the roman alphabet.  There are a LOT of languages
that do not use that alphabet.  A standardized transcription for
each language will have to be chosen.  I know of at least five methods
of transcribing Russian that are considered standard for some purpose.
Japanese has several different transcriptions.  Chinese too.  I don't
know how to come up with one transcription system that will cover that
kind of range of languages.  My first impulse would be to use some
variant of the IPA (international phonet alphabet), but transcriptions
are spelling to spelling translations.  Phonetic approaches aren't
particular relevant.
				-- J. Grass     ATT Bell Labs,  Murray Hill NJ
						ulysses!jeg

srg@quick.COM (Spencer Garrett) (10/01/87)

In article <2706@sol.ARPA>, crowl@cs.rochester.edu (Lawrence Crowl) writes:
> I submit that we need not only an international character code, but an
> international collating sequence as well.  Such a sequence should be very
> simple.  There should be no "double letter" rules or unnatural separation
> of accented letters from base letters.  I see no reason not to embed the
> collating sequence within the numeric codes for the characters.

Absolutely.

> Note that many letter forms in Latin, Greek, and Cryllic are the same.  It
> is possible to merge these three alphabets into a single alphabet.  This will
> involve some re-ordering of the letters from at least two of the original
> alphabets, but not a great deal.  I do not know whether this is a good idea or
> not, I just thought I would mention it.  Of course, we still have Arabic,
> Hebrew, Kanji, Kana, etc. to incorporate.

Technically very difficult and probably politically impossible.

> Perhaps a better approach is to start from scratch with a new character
> standard.  One designed from the start to accomodate international needs.
> I am willing to translate my files to a new character set.  Are you?

I think this has seeds of a good idea, and I would be willing to shift
to a new character set to accomplish it.  I'd like to suggest that it's
important for the alphabetic portion of the code to fit within 8 bits,
though, or the storage cost associated with shifting to the new code
will be prohibitive.  This wouldn't have to include katakana or hiragana
and couldn't possibly include kanji.  The JIS presently uses two 7-bit
codes per symbol and reaches them through a "shift-out" sequence from
a more-or-less standard ASCII.  There are way too many kanji to fit into
8 bits, and the notion of "collating sequence" doesn't really apply to
them.  (Actually, a clever encoding might make this a new "feature".)
Katakana and hiragana couldn't coexist with anything else in 8 bits
and they're presently encoded in 14 (really 16) bits, so retaining a 2-byte
encoding wouldn't cause any pain.  If we used an "escape to k-h" followed
by a byte to encode the character itself, then these characters would at
least collate together when mixed with this new international alphabet,
and would collate correctly with each other, all without changing the
semantics of strcmp().  (perhaps there should be a separate escape to
each, but you get the idea.)  Perhaps the escape to kanji would be followed
by two 8-bit bytes?  If the escape codes, at least, were standardized then
terminals which weren't set up to handle kanji could at least know how to
skip them and perhaps display an "unknown symbol" code in their place.

The final (:->) problem is how to mix l->r and r->l "horizontal" writing
with eastern "vertical" writing.  Mixing the first two is tricky, but
already being done.  I have no idea how to add "vertical" to the list.

Hmmm.  It just occurred to me that rewriting all the western languages in
a new alphabet and then trying to retain the existing japanese script is
a bit inconsistent.  It's not too hard to phoneticize japanese (they've
done it 3 times already, once using the roman alphabet) so maybe they
should just join us in using this mythical new alphabet.  I don't know
if this is possible for chinese and its relatives, however.  I suspect
it is not.

guy@gorodish.UUCP (10/01/87)

> I submit that we need not only an international character code, but an
> international collating sequence as well.  Such a sequence should be very
> simple.

Well, Esperanto is probably simpler than most natural languages (or, at least,
simpler than most European languages), but it's certainly not taken over the
world....  An international collating sequence could certainly be cooked up,
but in practice who (other than programmers) would want it?  I'd believe such a
collating sequence could replace existing national collating sequences if the
bulk of the people affected by it said it was OK.
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

walters@io.UUCP (Tim Walters) (10/01/87)

In article <2752@sol.ARPA> crowl@cs.rochester.edu (Lawrence Crowl) writes:
>I am not suggesting forcing people to abandon national sequences, just giving
>them an international alternative that is easy and efficient.

I'm afraid I can't see the advantage of a sorting sequence that's easy
and efficient but doesn't sort letters the way you want. I would never
use a routine which put, say, 'w' after 'z', even if it was efficient
and followed accepted practice in Europe; yet this is what your
proposed collating sequence would look like to someone in Sweden.
National sequences aren't just a matter of local taste; they are THE
way dictionaries, phone books, book indexes, and everything else are
sorted in a particular country.

Since there isn't really any acceptable common collating sequence, I
would much rather see an efficient standardized routine which can
collate according to any national standard. I would much rather use
such a routine in my code, knowing that it could be configured to
produce acceptable output for any country.

-- 
	...!harvard!umb!ileaf!walters	Tim Walters, Interleaf
	  ...!sun!sunne!ileaf!walters	Ten Canal Park, Cambridge, MA 02141
					(617) 577-9813 x5510

aeb@mcvax.UUCP (10/02/87)

In article <29640@sun.uucp> guy%gorodish@Sun.COM (Guy Harris) writes:
>An international collating sequence could certainly be cooked up,
>but in practice who (other than programmers) would want it?

In a bibliographic journal one is forced to list the authors
or names of journals in some order, mixing names from many different
languages and using many different types of diacritical marks.
Thus, one has to define some "international collating sequence"
in such a situation. Does G\*:odel come before or after Godsil?

-- 
      Andries Brouwer -- CWI, Amsterdam -- uunet!mcvax!aeb -- aeb@cwi.nl

guy%gorodish@Sun.COM (Guy Harris) (10/04/87)

> In a bibliographic journal one is forced to list the authors
> or names of journals in some order, mixing names from many different
> languages and using many different types of diacritical marks.
> Thus, one has to define some "international collating sequence"
> in such a situation. Does G\*:odel come before or after Godsil?

Yes, but would you want your phone books sorted using this sequence?  Unless
you can eliminate *all* uses of national collating sequences when used by
computers, an international collating sequence would only be able to
*supplement*, not *replace*, national collating sequences.  As such, you'd
still have to have code to handle the national collating sequences; the
international collating sequence would be yet another variant, along with all
the national collating sequences.  The bulk of the collating sequence problem
would be unaffected by this international collating sequence.

Also, if the primary intent of this sequence is to support bibliographies
with authors and titles in multiple languages, it's not clear that overloading
e.g. the glyph "H" with the meanings "aitch" in the Roman alphabet, "eta" in
the Greek alphabet, and "en" in the Cyrillic alphabet would be necessary; would
not such databases be, at least in countries using the Roman alphabet,
Romanized?  I don't know whether bibliographies in Greek or in languages using
the Cyrillic alphabet Hellenize or Cyrilify (?) foreign names.  Given that,
would you need a single international character set and accompanying collating
sequence?

(For that matter, would the same bibliographical journal be sorted the same way
when prepared in several different languages, or would the native collating
sequence be used?  Would the same bibliographical journal even *look* the same
when prepared in different languages?  "Moskva" is turned into "Moscow" in
English, but is it turned into "Moscow" in other languages as well"?)
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

walters@io.UUCP (Tim Walters) (10/05/87)

In article <1297@haddock.ISC.COM> karl@haddock.ima.isc.com (Karl Heuer) writes:
>In article <393@io.UUCP> walters@wally.UUCP (Tim Walters) writes:
>>I'm afraid I can't see the advantage of a sorting sequence that's easy
>>and efficient but doesn't sort letters the way you want.  I would never
>>use a routine which put, say, 'w' after 'z'...
>
>Funny, I use one all the time (ASCII strcmp) that sorts 'a' after 'Z', even
>though that's not the standard way to sort things in my native language
>(American English).

Well, I use strcmp quite a bit myself, mostly because it's there and
easy to use. It does put 'a' after 'Z', but this usually isn't too
much of a problem since most of my text is all caps, initial caps, or
all lower case. That just means (at most) three ranges of text to look
at, with everything ordered nicely within those ranges. Even so, I
think in most cases I would prefer to call, say, 'nstrcmp' which
sorted sorted things according to a (user configurable) standard
dictionary ordering.

>>National sequences ... are THE way dictionaries, phone books, book indexes,
>>and everything else are sorted in a particular country.
>
>Even within a country, it's not completely consistent.  German \(o" collates
>as `o' in the dictionary, but `oe' in the phone book.  I believe it has been
>stated in this newsgroup that Dutch \(ij can sort as `ij', `y', or a letter
>between `x' and `y'.

You're right, it was a little too broad to say that there was only one
standard per country. I had forgotten about the different sorting
standards in Germany. I hadn't heard about the alternate sortings of
the Dutch ij.  There are probably other countries which have more than
one way of sorting in certain contexts. I would argue, however, that
this does not mean that people can easily, or happily, adapt to a new
sorting standard; rather, I think it means that end users would like
to select the sorting sequence themselves.

There is a similar diversity in the national preferences for formats
of dates. A single standard output format for dates might be
acceptable in a few cases, but most people will prefer to see them
written the way they're used to seeing them.
-- 
	...!harvard!umb!ileaf!walters	Tim Walters, Interleaf
	  ...!sun!sunne!ileaf!walters	Ten Canal Park, Cambridge, MA 02141
					(617) 577-9813 x5510

gnu@hoptoad.uucp (John Gilmore) (10/05/87)

aeb@cwi.nl (Andries Brouwer) wrote:
> Thus, one has to define some "international collating sequence"
> in such a situation. Does G\*:odel come before or after Godsil?

One possible solution to this problem is to define multiple code values
with the same graphic image, but different sorts.  In other words, if
there are seventeen languages that use a' and it sorts in four different
positions, give it four codes, and depend on the typist to enter the
right code.  (Actually, you're depending on the keyboard translation table
most of the time, which should be right for your country.)

This would even make names from different languages sort properly; e.g.
in the right place for their native language.  Speakers of other languages
would get confused about where to look, though.  It also implies an
exhaustive research effort and puts constraints on new languages.

Personally I wouldn't mind changing over to a new international
alphabet where American "w" sorted after "z".  Of course, the change
would be gradual; international publications such as newspapers would
do it first, and it would eventually spread to the rest of the society
as everything became more international, and as people got used to it.
Like the changeover of Romanized Chinese systems a few years ago
(Peking->Beijing).  The most important aspect about such a change, for
me, is that I'd only want to do it once.

PS:  Many publications already have indices containing characters with
no well-defined sorting order, e.g. symbols and numbers.  Take any
computer science textbook as example; where does "/*EOF" sort?  Where
do you find "3Com" in the phone book?  (Mountain View, I know :-)
-- 
{dasys1,ncoast,well,sun,ihnp4}!hoptoad!gnu			  gnu@toad.com

kimcm@ambush.UUCP (Kim Chr. Madsen) (10/06/87)

In article <363@zuring.cwi.nl> aeb@cwi.nl (Andries Brouwer) writes:
>Thus, one has to define some "international collating sequence"
>in such a situation. Does G\*:odel come before or after Godsil?

Or more than that,
assume that you'll have to sort the names of authors according to
surnames and the name "A. J. Dijon" appeared in the list how should
the "ij" be interpreted as Dutch "y" as spanish "ij" or as two
separate letters - well that depends on the origin of Mr. Dijon so
you'll probably have to have some semantic put into the list of names
to make it work like "A. J. Dijon (<national-code>)", and probably you
will have to use specialized tools to sort such a list so ...
International collating sequence may be a good thing but we need more
than that to make it work.

					Kim Chr. Madsen.

agc@ist.UUCP (Alistair G. Crooks) (10/09/87)

In article <363@zuring.cwi.nl>, aeb@cwi.nl (Andries Brouwer) writes:
> In article <29640@sun.uucp> guy%gorodish@Sun.COM (Guy Harris) writes:
> >An international collating sequence could certainly be cooked up,
> >but in practice who (other than programmers) would want it?
> 
> [...bibliographic journal example deleted...]
> Thus, one has to define some "international collating sequence"
> in such a situation. Does G\*:odel come before or after Godsil?
>       Andries Brouwer -- CWI, Amsterdam -- uunet!mcvax!aeb -- aeb@cwi.nl

With all these thoughts about each separate language's character-sorting
properties, and all the talk in comp.arch on shared libraries (thankfully
dying out now), led me to thinking:

   Why not strip the string routines from libc into a shared library,
   that can be dynamically linked at run time? Or even just strcmp()
   and strncmp()?

This would mean manufacturers would have to write these routines once for
each language. Yes, I know that not everyone has shared libraries (yet),
but in a few months or years time?

Comments, ideas, anyone?

Alistair G. Crooks (agc@ist.co.uk or ...!mcvax!ist!agc)

karl@haddock.ISC.COM (Karl Heuer) (10/13/87)

In article <1483@ist.UUCP> agc@ist.UUCP (Alistair G. Crooks) writes:
>Why not strip the string routines from libc into a shared library,
>that can be dynamically linked at run time? Or even just strcmp()
>and strncmp()?

Actually, a shared library normally gives you the choice at startup-time
rather than run-time.

I don't think it's appropriate to replace strcmp().  Most uses of strcmp() are
to test two strings for exact equality; these should be left alone.  The ANSI
C library includes a new function (it's "strcoll()" in the Oct86 dpANS; but I
think it may have changed since then) which will "digest" any string in a
locale-specific way.  (For example, in German-Telephone-Directory mode it
could map "Schr\(o"der" to "schroeder".)  This seems like a good approach.

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

guy%gorodish@Sun.COM (Guy Harris) (10/13/87)

>    Why not strip the string routines from libc into a shared library,
>    that can be dynamically linked at run time? Or even just strcmp()
>    and strncmp()?

Because:

	1) Some systems will permit you to link a program completely
	statically, so they won't be affected by changes to the shared library.

	2) This makes it harder to add support for new collating sequences, as
	the person adding this support has to build an entirely new shared
	library.

	3) This also makes it harder for a program to *change* the current
	language in midstream; one could imagine a program (e.g., a
	multilingual word processor) wanting to do so.

Also, changing "strcmp" would cause problems, because some program might want
to support *both* a particular natural language's sorting order and the
"native" byte-string sorting order.

The X3J11 committee developing the ANSI C has already come up with schemes to
support sorting orders other than the native byte-string order; these schemes
permit programs to change the sorting order on-the-fly, and to use "strcmp"
directly if this is called for.  Typical implementations will, one hopes, load
the sorting order information from a file, based on the current locale.  This
file will be tailorable by people without the source code, so that, for
example, a vendor could distribute the system worldwide and have people in each
country tailor it for their environment (since developers at the home office
may not know that country's environment as well as the people in that country).
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

bert@aiva.ed.ac.uk (Bert Hutchings) (10/14/87)

Here's a Scottish contribution to this topic.  The British Post Office takes
an apparently cavalier, but entirely practical, approach to sorting Scottish
surnames beginning with 'Mac' in their telephone directories - every variant
spelling of this prefix is sorted as Mac, and the case of the next letter is
ignored,  but the name is printed as the subscriber prefers to use it.  Thus

	MacDonald
	McEnroe
	M`Farquhar
	Macgillicuddy
	Machine Tool Hire Ltd.
	Mcilwraith
	M`indoe

are in collated sequence. I think that special cases like this, and like the
common requirement  to elide 'a', 'the' etc,  support the negative view that
an underlying almost-ready-to-use character collation sequence is so small a
component of every desired end product that it isn't really worth the effort.

bob@its63b.ed.ac.uk (ERCF08 Bob Gray) (10/16/87)

In article <176@aiva.ed.ac.uk> bert@aiva.ed.ac.uk (Bert Hutchings) writes:
>surnames beginning with 'Mac' in their telephone directories - every variant
>spelling of this prefix is sorted as Mac, and the case of the next letter is
>ignored,  but the name is printed as the subscriber prefers to use it.  Thus
>
>	MacDonald
>	McEnroe
>	M`Farquhar
>	Macgillicuddy
>	Machine Tool Hire Ltd.
>	Mcilwraith
>	M`indoe

Just to confuse things further, some women whose family name
begins with the Mac (meaning son of) prefix, are insisting on
being known by the Nic (meaning daughter of) prefix.

yet more special cases to be taken care of.

> ... support the negative view that
>an underlying almost-ready-to-use character collation sequence is so small a
>component of every desired end product that it isn't really worth the effort.

Any collation sequence would also have to be easily
re-defined at a user level to indicate local changes in
sequence, or changes with time. Any product could have an
"international" sequence but the options should always be
there to easily override the default options.
	Bob.