[net.internat] Alphabetical Order

johnl@ima.UUCP (10/08/85)

Let's talk for a minute or two about putting strings in alphabetical order.
Here, as I understand it, are some of the problems involved:

  -- Character set.  The codes used for characters not found in the English 
alphabet are not well standardized.  Some people reassign the "national 
option" characters which in the U.S.  are things like curly braces.  Some, 
like the IBM PC crowd, try to define an 8-bit character set.  Some, like 
the Teletex crowd, define multi-byte sequences for characters with 
accents.  I have no idea what happens to characters like the Icelandic eth 
and thorn which are not created by adding an accent to an English letter.  

  -- Upper vs.  lower case.  The mapping between upper and lower case is 
quite language specific.  Some languages are quite strict about mapping 
between corresponding accented upper and lower case, while others (French, 
notably) are pretty casual about their upper case accented letters.  I 
gather that there are languages with lower case letters that have no upper 
case equivalent.  

  -- Digraphs.  Many languages have character pairs which, for the purpose
of alphabetization, are treated as one letter, such as Spanish "ll".

  -- Alphabet order.  Some languages sort accented letters in next to their 
unaccented versions.  Others put them at the end of the alphabet or 
otherwise scramble them around.  

Anything else important I've left out?

John Levine, Javelin Software, Cambridge MA 617-494-1400
{ decvax!cca | think | ihnp4 | cbosgd }!ima!johnl, Levine@YALE.ARPA

The opinions above are solely those of a 12 year old hacker who has broken
into my account, and not those of my employer or any other organization.

colonel@sunybcs.UUCP (Col. G. L. Sicherman) (11/03/85)

["You saved my life, Captain Buffalo!  Have a CIGAR!"]

> Here, as I understand it, are some of the problems involved:
> 
>   -- Character set.  ...
>   -- Upper vs. lower case. ...
>   -- Digraphs.  Many languages have character pairs which, for the purpose
> of alphabetization, are treated as one letter, such as Spanish "ll".
>   -- Alphabet order. ...
> 
> Anything else important I've left out?

How about equivalence?  A language might interfile "x" with "j", for
instance.  The Dutch interfile "y" and the digraph "ij".

(Or do they?  Can anybody think of a Dutch word in which "ij" is
_not_ equivalent to "y"?)
-- 
Col. G. L. Sicherman
UU: ...{rocksvax|decvax}!sunybcs!colonel
CS: colonel@buffalo-cs
BI: csdsicher@sunyabva

dik@zuring.UUCP (11/04/85)

In article <2435@sunybcs.UUCP> colonel@sunybcs.UUCP (Col. G. L. Sicherman) writes:
>How about equivalence?  A language might interfile "x" with "j", for
>instance.  The Dutch interfile "y" and the digraph "ij".
>
Not entirely true; there are three sorting orders in use for "ij":
1. Dictionary order: sort amongst i.
2. Encyclopaedical order: sort as a different letter (also different from y).
3. Most general: sort as equivalent to y.

>(Or do they?  Can anybody think of a Dutch word in which "ij" is
>_not_ equivalent to "y"?)

Yes (although the words that come to my mind are not of dutch origin):
	bijouterie (from french of course): i and j do not form a digraph
	here but are two distinctive letters.

>-- 
>Col. G. L. Sicherman

Sorting is however not such a problem: just write appropriate filters
that prepend the objects to be sorted with a key etc.
-- 
dik t. winter, cwi, amsterdam, nederland
UUCP: {seismo|decvax|philabs}!mcvax!dik

mikeb@inset.UUCP (Mike Banahan) (11/06/85)

In article <2435@sunybcs.UUCP> colonel@sunybcs.UUCP (Col. G. L. Sicherman) writes:
>How about equivalence?  A language might interfile "x" with "j", for
>instance.  The Dutch interfile "y" and the digraph "ij".
>
>(Or do they?  Can anybody think of a Dutch word in which "ij" is
>_not_ equivalent to "y"?)

I'm sure the Dutch will tell you.

Just to add that in Norwegian orthography "aa" is an alternative for
the a with a circle on top. They are identical for all purposes.

You might care to ponder languages where lower case letters have no
upper case equivalent and vice versa, into the bargain.
-- 
Mike Banahan, Technical Director, The Instruction Set Ltd.
mcvax!ukc!inset!mikeb

jbn@wdl1.UUCP (11/08/85)

       Is there a collating sequence for Kanjii?

					John Nagle

kimcm@diku.UUCP (Kim Christian Madsen) (11/12/85)

In article <36@diku.UUCP> keld@diku.UUCP (Keld J|rn Simonsen) writes:
>Well it is not true that 'aa' always can be replaced with
>a-with-a-circle-on-top in Danish (or Norwegian) writing.
>You may have connected words like 'ekstraarbejde' = extra work,
>where the two a's cannot be replaced.
>The same is true for 'ae' - eg. in 'sagaen' = the saga 
>and for 'oe' eg. 'koen' = the cow.

Which leads to the conclusion that either we have to do it by table
lookup or not do it at all. There are always exceptions, and if we can't
live with some compromises we cannot get out of the place!

Even if one does the job opn a national basis there will be troubles,
because no language is frozen, as an example is the danish letter
a-with-circle-on-top which was invented in this century and made official
in 1948 - the same can happen again!

Furthermore the improved communication between different parts of the world
leads to more and more 'foreign words' being accepted in each national
language and new words are evolved and incorporated into the language and
older and rarely used words disappear.

Computer Scientists has always thought that sorting words was a piece of
cake, but people whose work is to make dictionaries might use a computer
sorted wordlist as a first draft and then do the rest of the work by hand.

But if we continue to use the ordinary ASCII sorting method, and it is
recoqnized in further more applications we might end up with making
ASCII sorting the standard sorting method, but I wonder if this makes
anybody happy -- save the programmers )-;

But who cares, some people like the old english and refuse to read 
Shakespeare's pieces unless it's in the 'original' old english version,
others are blasting americans (you know the people from over the sea,
NO, not australians YANKEE's...(-;) because they don't speak a 'proper'
english (and not even proper american!!! )-;) Old traditions must fall
and new rules be established -- that's the way of progress.

				Kim Chr. Madsen

spw2562@ritcv.UUCP (11/13/85)

In article <40@diku.UUCP> kimcm@diku.UUCP (Kim Christian Madsen) writes:
>But if we continue to use the ordinary ASCII sorting method, and it is
>recoqnized in further more applications we might end up with making
>ASCII sorting the standard sorting method, but I wonder if this makes
>anybody happy -- save the programmers )-;

That's make me real happy.. 8-)

>others are blasting americans ... because they don't speak a 'proper'
>english (and not even proper american!!! )-;)
>				Kim Chr. Madsen

Hey, every language has its dialects... ;-)

BTW, all americans (USians?) aren't yankees, just us northerners.

==============================================================================
	Steve Wall @ Rochester Institute of Technology
	USnail: 6675 Crosby Rd, Lockport, NY 14094, USA
	Usenet:	...!ritcv!spw2562			Unix 4.2 BSD
	BITNET:	SPW2562@RITVAXC				VAX/VMS 4.2
	Voice:  Yell "Hey Steve!"

andrew@stc.UUCP (11/14/85)

{}

I think we are in danger confusing two different aspects of sorting.

1/ The simple case of ``sorting'' in whatever-my-machine-likes order
   for table lookup (automated binary searches, hash lookup etc)

2/ Sorting for human consumption.  This is almost certainly not character
   set order, and may not be even remotely related (Yes even in English).

Type 1 is largely irrelevant to internationalisation, except in as much
as this is the type of operation carried out by *all* our
general-purpose utilities, but there is little need to change these, as
we are doubtless more interested in the internal efficiency than the
external order (cf. dbm).

The other question (type 2) is a much more involved operation.  I
suggest we all reach for D.E.Knuth's book ``The Art of Computer
Programming'' volume 3 ``Sorting and Searching'' pp 7-9 exercise 16.
This spells out the problem much better than I could (and cuts down
the total news traffic).

It is fairly obvious that real sorting for humans will involve
sufficient heuristics to make the natural order of the internal
character set immaterial.  It seems likely that such sorting will have
to be done on a per-language (and to a certain extent per-country)
basis.

This is not to say that multiple-alphabets and their internal representation
is not relevant and interesting, (and I can't contribute to that discussion)
merely that how such multi-lingual text sorts in simple per character
comparisons is almost a red herring.
-- 
Regards,
	Andrew Macpherson.	<andrew@stc.UUCP>
	{aivru,creed,datlog,iclbra,iclkid,idec,inset,root44,stl,ukc}!stc!andrew

mikeb@inset.UUCP (Mike Banahan) (11/15/85)

In article <40@diku.UUCP> kimcm@diku.UUCP (Kim Christian Madsen) writes:
>......
>others are blasting americans (you know the people from over the sea,
>NO, not australians YANKEE's...(-;) because they don't speak a 'proper'
>english (and not even proper american!!! )-;) .....
>				Kim Chr. Madsen

Ho ho ho. Now, as I put on the flame reistant underwear, asbestos suit
cooled by liquid nitrogen, let us open a debate (which should soon
move to net.nlang.

PROPSITION: Americans, in the large, not only cannot speak English,
	they can't even *understand* it.

SUPPORTING EVIDENCE: Whenever I go there, I have to drop into Standard
English (restricted vocabulary of 1500 words, no idiomatic use), and use
Received Pronunciation (special attention paid to stress points, word endings;
standardised vowel sounds), if I want to be understood. I find that use
of normal spoken English results in incessant requests from U.S. native
citizens for me to slow down and repeat things; occasionally blank
gazes make me realise that I am just not being uderstood at all.

Interestingly, Australians have no trouble whatsoever with English English.
We, of course, have trouble understanding them :-)

Has anyone else noticed this phenomenon?

-- 
Mike Banahan, Technical Director, The Instruction Set Ltd.
mcvax!ukc!inset!mikeb

levy@ttrdc.UUCP (Daniel R. Levy) (11/19/85)

In article <797@inset.UUCP>, mikeb@inset.UUCP (Mike Banahan) writes:
>
>Ho ho ho. Now, as I put on the flame reistant underwear, asbestos suit
>cooled by liquid nitrogen, let us open a debate (which should soon
>move to net.nlang.
>
>PROPSITION: Americans, in the large, not only cannot speak English,
>	they can't even *understand* it.
>
>--
>Mike Banahan, Technical Director, The Instruction Set Ltd.
>mcvax!ukc!inset!mikeb

I don't believe a word of your "PROPSITION." :-)
-- 
 -------------------------------    Disclaimer:  The views contained herein are
|       dan levy | yvel nad      |  my own and are not at all those of my em-
|         an engihacker @        |  ployer or the administrator of any computer
| at&t computer systems division |  upon which I may hack.
|        skokie, illinois        |
 --------------------------------   Path: ..!ihnp4!ttrdc!levy

planting@uwvax.UUCP (W. Harry Plantinga) (11/21/85)

In article <797@inset.UUCP>, mikeb@inset.UUCP (Mike Banahan) writes:
>
>PROPSITION: Americans, in the large, not only cannot speak English,
>	they can't even *understand* it.
>
>Mike Banahan, Technical Director, The Instruction Set Ltd.
>mcvax!ukc!inset!mikeb
> 

What was that? Eh?

spw2562@ritcv.UUCP (11/21/85)

In article <797@inset.UUCP> mikeb@inset.UUCP (Mike Banahan) writes:
>In article <40@diku.UUCP> kimcm@diku.UUCP (Kim Christian Madsen) writes:
>>others are blasting americans (you know the people from over the sea,
>>NO, not australians YANKEE's...(-;) because they don't speak a 'proper'
>>english (and not even proper american!!! )-;) .....
>>				Kim Chr. Madsen
>Ho ho ho. Now, as I put on the flame reistant underwear, asbestos suit
>cooled by liquid nitrogen, let us open a debate

flame retardant- good idea..  VERY good idea..

>PROPSITION: Americans, in the large, not only cannot speak English,
>	they can't even *understand* it.

Hey, we speak english great.  It's british we have trouble with...

>SUPPORTING EVIDENCE: Whenever I go there, I have to drop into Standard
>English (restricted vocabulary of 1500 words, no idiomatic use), and use
>Received Pronunciation (special attention paid to stress points, word endings;
>standardised vowel sounds), if I want to be understood. I find that use
>of normal spoken English results in incessant requests from U.S. native
>citizens for me to slow down and repeat things; occasionally blank
>gazes make me realise that I am just not being uderstood at all.

If this is your evidence, maybe you should try learning english 8-)...

>Mike Banahan, Technical Director, The Instruction Set Ltd.
>mcvax!ukc!inset!mikeb

At least we know what diapers and bisquits are...   8-)

==============================================================================
        Steve Wall [Snoopy] @ Rochester Institute of Technology
        USnail: 6675 Crosby Rd, Lockport, NY 14094, USA
        Usenet: ...!ritcv!spw2562                       Unix 4.2 BSD
        BITNET: SPW2562@RITVAXC                         VAX/VMS 4.2
        Voice:  Yell "Hey Steve!"

    Disclaimer:  What I just said may or may not have anything to do with
                 what I meant to say...

andrew@stc.UUCP (11/25/85)

In article <9065@ritcv.UUCP> you write:
>In article <797@inset.UUCP> mikeb@inset.UUCP (Mike Banahan) writes:
>>PROPSITION: Americans, in the large, not only cannot speak English,
>>	they can't even *understand* it.
>
>Hey, we speak english great.  It's british we have trouble with...
>
>At least we know what diapers and bisquits are...   8-)

I assure you that the British are well aware of the plurals for a linen
or cotton napkin for infants, and unglazed white porcelain respectively;
or are you perhaps an architect, and using diaper in the technical sense?

:-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-)
-- 
Regards,
	Andrew Macpherson.	<andrew@stc.UUCP>
	{aivru,creed,datlog,iclbra,iclkid,idec,inset,root44,stl,ukc}!stc!andrew