johnl@ima.UUCP (10/08/85)
Let's talk for a minute or two about putting strings in alphabetical order. Here, as I understand it, are some of the problems involved: -- Character set. The codes used for characters not found in the English alphabet are not well standardized. Some people reassign the "national option" characters which in the U.S. are things like curly braces. Some, like the IBM PC crowd, try to define an 8-bit character set. Some, like the Teletex crowd, define multi-byte sequences for characters with accents. I have no idea what happens to characters like the Icelandic eth and thorn which are not created by adding an accent to an English letter. -- Upper vs. lower case. The mapping between upper and lower case is quite language specific. Some languages are quite strict about mapping between corresponding accented upper and lower case, while others (French, notably) are pretty casual about their upper case accented letters. I gather that there are languages with lower case letters that have no upper case equivalent. -- Digraphs. Many languages have character pairs which, for the purpose of alphabetization, are treated as one letter, such as Spanish "ll". -- Alphabet order. Some languages sort accented letters in next to their unaccented versions. Others put them at the end of the alphabet or otherwise scramble them around. Anything else important I've left out? John Levine, Javelin Software, Cambridge MA 617-494-1400 { decvax!cca | think | ihnp4 | cbosgd }!ima!johnl, Levine@YALE.ARPA The opinions above are solely those of a 12 year old hacker who has broken into my account, and not those of my employer or any other organization.
colonel@sunybcs.UUCP (Col. G. L. Sicherman) (11/03/85)
["You saved my life, Captain Buffalo! Have a CIGAR!"] > Here, as I understand it, are some of the problems involved: > > -- Character set. ... > -- Upper vs. lower case. ... > -- Digraphs. Many languages have character pairs which, for the purpose > of alphabetization, are treated as one letter, such as Spanish "ll". > -- Alphabet order. ... > > Anything else important I've left out? How about equivalence? A language might interfile "x" with "j", for instance. The Dutch interfile "y" and the digraph "ij". (Or do they? Can anybody think of a Dutch word in which "ij" is _not_ equivalent to "y"?) -- Col. G. L. Sicherman UU: ...{rocksvax|decvax}!sunybcs!colonel CS: colonel@buffalo-cs BI: csdsicher@sunyabva
dik@zuring.UUCP (11/04/85)
In article <2435@sunybcs.UUCP> colonel@sunybcs.UUCP (Col. G. L. Sicherman) writes: >How about equivalence? A language might interfile "x" with "j", for >instance. The Dutch interfile "y" and the digraph "ij". > Not entirely true; there are three sorting orders in use for "ij": 1. Dictionary order: sort amongst i. 2. Encyclopaedical order: sort as a different letter (also different from y). 3. Most general: sort as equivalent to y. >(Or do they? Can anybody think of a Dutch word in which "ij" is >_not_ equivalent to "y"?) Yes (although the words that come to my mind are not of dutch origin): bijouterie (from french of course): i and j do not form a digraph here but are two distinctive letters. >-- >Col. G. L. Sicherman Sorting is however not such a problem: just write appropriate filters that prepend the objects to be sorted with a key etc. -- dik t. winter, cwi, amsterdam, nederland UUCP: {seismo|decvax|philabs}!mcvax!dik
mikeb@inset.UUCP (Mike Banahan) (11/06/85)
In article <2435@sunybcs.UUCP> colonel@sunybcs.UUCP (Col. G. L. Sicherman) writes: >How about equivalence? A language might interfile "x" with "j", for >instance. The Dutch interfile "y" and the digraph "ij". > >(Or do they? Can anybody think of a Dutch word in which "ij" is >_not_ equivalent to "y"?) I'm sure the Dutch will tell you. Just to add that in Norwegian orthography "aa" is an alternative for the a with a circle on top. They are identical for all purposes. You might care to ponder languages where lower case letters have no upper case equivalent and vice versa, into the bargain. -- Mike Banahan, Technical Director, The Instruction Set Ltd. mcvax!ukc!inset!mikeb
jbn@wdl1.UUCP (11/08/85)
Is there a collating sequence for Kanjii? John Nagle
kimcm@diku.UUCP (Kim Christian Madsen) (11/12/85)
In article <36@diku.UUCP> keld@diku.UUCP (Keld J|rn Simonsen) writes: >Well it is not true that 'aa' always can be replaced with >a-with-a-circle-on-top in Danish (or Norwegian) writing. >You may have connected words like 'ekstraarbejde' = extra work, >where the two a's cannot be replaced. >The same is true for 'ae' - eg. in 'sagaen' = the saga >and for 'oe' eg. 'koen' = the cow. Which leads to the conclusion that either we have to do it by table lookup or not do it at all. There are always exceptions, and if we can't live with some compromises we cannot get out of the place! Even if one does the job opn a national basis there will be troubles, because no language is frozen, as an example is the danish letter a-with-circle-on-top which was invented in this century and made official in 1948 - the same can happen again! Furthermore the improved communication between different parts of the world leads to more and more 'foreign words' being accepted in each national language and new words are evolved and incorporated into the language and older and rarely used words disappear. Computer Scientists has always thought that sorting words was a piece of cake, but people whose work is to make dictionaries might use a computer sorted wordlist as a first draft and then do the rest of the work by hand. But if we continue to use the ordinary ASCII sorting method, and it is recoqnized in further more applications we might end up with making ASCII sorting the standard sorting method, but I wonder if this makes anybody happy -- save the programmers )-; But who cares, some people like the old english and refuse to read Shakespeare's pieces unless it's in the 'original' old english version, others are blasting americans (you know the people from over the sea, NO, not australians YANKEE's...(-;) because they don't speak a 'proper' english (and not even proper american!!! )-;) Old traditions must fall and new rules be established -- that's the way of progress. Kim Chr. Madsen
spw2562@ritcv.UUCP (11/13/85)
In article <40@diku.UUCP> kimcm@diku.UUCP (Kim Christian Madsen) writes: >But if we continue to use the ordinary ASCII sorting method, and it is >recoqnized in further more applications we might end up with making >ASCII sorting the standard sorting method, but I wonder if this makes >anybody happy -- save the programmers )-; That's make me real happy.. 8-) >others are blasting americans ... because they don't speak a 'proper' >english (and not even proper american!!! )-;) > Kim Chr. Madsen Hey, every language has its dialects... ;-) BTW, all americans (USians?) aren't yankees, just us northerners. ============================================================================== Steve Wall @ Rochester Institute of Technology USnail: 6675 Crosby Rd, Lockport, NY 14094, USA Usenet: ...!ritcv!spw2562 Unix 4.2 BSD BITNET: SPW2562@RITVAXC VAX/VMS 4.2 Voice: Yell "Hey Steve!"
andrew@stc.UUCP (11/14/85)
{} I think we are in danger confusing two different aspects of sorting. 1/ The simple case of ``sorting'' in whatever-my-machine-likes order for table lookup (automated binary searches, hash lookup etc) 2/ Sorting for human consumption. This is almost certainly not character set order, and may not be even remotely related (Yes even in English). Type 1 is largely irrelevant to internationalisation, except in as much as this is the type of operation carried out by *all* our general-purpose utilities, but there is little need to change these, as we are doubtless more interested in the internal efficiency than the external order (cf. dbm). The other question (type 2) is a much more involved operation. I suggest we all reach for D.E.Knuth's book ``The Art of Computer Programming'' volume 3 ``Sorting and Searching'' pp 7-9 exercise 16. This spells out the problem much better than I could (and cuts down the total news traffic). It is fairly obvious that real sorting for humans will involve sufficient heuristics to make the natural order of the internal character set immaterial. It seems likely that such sorting will have to be done on a per-language (and to a certain extent per-country) basis. This is not to say that multiple-alphabets and their internal representation is not relevant and interesting, (and I can't contribute to that discussion) merely that how such multi-lingual text sorts in simple per character comparisons is almost a red herring. -- Regards, Andrew Macpherson. <andrew@stc.UUCP> {aivru,creed,datlog,iclbra,iclkid,idec,inset,root44,stl,ukc}!stc!andrew
mikeb@inset.UUCP (Mike Banahan) (11/15/85)
In article <40@diku.UUCP> kimcm@diku.UUCP (Kim Christian Madsen) writes: >...... >others are blasting americans (you know the people from over the sea, >NO, not australians YANKEE's...(-;) because they don't speak a 'proper' >english (and not even proper american!!! )-;) ..... > Kim Chr. Madsen Ho ho ho. Now, as I put on the flame reistant underwear, asbestos suit cooled by liquid nitrogen, let us open a debate (which should soon move to net.nlang. PROPSITION: Americans, in the large, not only cannot speak English, they can't even *understand* it. SUPPORTING EVIDENCE: Whenever I go there, I have to drop into Standard English (restricted vocabulary of 1500 words, no idiomatic use), and use Received Pronunciation (special attention paid to stress points, word endings; standardised vowel sounds), if I want to be understood. I find that use of normal spoken English results in incessant requests from U.S. native citizens for me to slow down and repeat things; occasionally blank gazes make me realise that I am just not being uderstood at all. Interestingly, Australians have no trouble whatsoever with English English. We, of course, have trouble understanding them :-) Has anyone else noticed this phenomenon? -- Mike Banahan, Technical Director, The Instruction Set Ltd. mcvax!ukc!inset!mikeb
levy@ttrdc.UUCP (Daniel R. Levy) (11/19/85)
In article <797@inset.UUCP>, mikeb@inset.UUCP (Mike Banahan) writes: > >Ho ho ho. Now, as I put on the flame reistant underwear, asbestos suit >cooled by liquid nitrogen, let us open a debate (which should soon >move to net.nlang. > >PROPSITION: Americans, in the large, not only cannot speak English, > they can't even *understand* it. > >-- >Mike Banahan, Technical Director, The Instruction Set Ltd. >mcvax!ukc!inset!mikeb I don't believe a word of your "PROPSITION." :-) -- ------------------------------- Disclaimer: The views contained herein are | dan levy | yvel nad | my own and are not at all those of my em- | an engihacker @ | ployer or the administrator of any computer | at&t computer systems division | upon which I may hack. | skokie, illinois | -------------------------------- Path: ..!ihnp4!ttrdc!levy
planting@uwvax.UUCP (W. Harry Plantinga) (11/21/85)
In article <797@inset.UUCP>, mikeb@inset.UUCP (Mike Banahan) writes: > >PROPSITION: Americans, in the large, not only cannot speak English, > they can't even *understand* it. > >Mike Banahan, Technical Director, The Instruction Set Ltd. >mcvax!ukc!inset!mikeb > What was that? Eh?
spw2562@ritcv.UUCP (11/21/85)
In article <797@inset.UUCP> mikeb@inset.UUCP (Mike Banahan) writes: >In article <40@diku.UUCP> kimcm@diku.UUCP (Kim Christian Madsen) writes: >>others are blasting americans (you know the people from over the sea, >>NO, not australians YANKEE's...(-;) because they don't speak a 'proper' >>english (and not even proper american!!! )-;) ..... >> Kim Chr. Madsen >Ho ho ho. Now, as I put on the flame reistant underwear, asbestos suit >cooled by liquid nitrogen, let us open a debate flame retardant- good idea.. VERY good idea.. >PROPSITION: Americans, in the large, not only cannot speak English, > they can't even *understand* it. Hey, we speak english great. It's british we have trouble with... >SUPPORTING EVIDENCE: Whenever I go there, I have to drop into Standard >English (restricted vocabulary of 1500 words, no idiomatic use), and use >Received Pronunciation (special attention paid to stress points, word endings; >standardised vowel sounds), if I want to be understood. I find that use >of normal spoken English results in incessant requests from U.S. native >citizens for me to slow down and repeat things; occasionally blank >gazes make me realise that I am just not being uderstood at all. If this is your evidence, maybe you should try learning english 8-)... >Mike Banahan, Technical Director, The Instruction Set Ltd. >mcvax!ukc!inset!mikeb At least we know what diapers and bisquits are... 8-) ============================================================================== Steve Wall [Snoopy] @ Rochester Institute of Technology USnail: 6675 Crosby Rd, Lockport, NY 14094, USA Usenet: ...!ritcv!spw2562 Unix 4.2 BSD BITNET: SPW2562@RITVAXC VAX/VMS 4.2 Voice: Yell "Hey Steve!" Disclaimer: What I just said may or may not have anything to do with what I meant to say...
andrew@stc.UUCP (11/25/85)
In article <9065@ritcv.UUCP> you write: >In article <797@inset.UUCP> mikeb@inset.UUCP (Mike Banahan) writes: >>PROPSITION: Americans, in the large, not only cannot speak English, >> they can't even *understand* it. > >Hey, we speak english great. It's british we have trouble with... > >At least we know what diapers and bisquits are... 8-) I assure you that the British are well aware of the plurals for a linen or cotton napkin for infants, and unglazed white porcelain respectively; or are you perhaps an architect, and using diaper in the technical sense? :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) :-) -- Regards, Andrew Macpherson. <andrew@stc.UUCP> {aivru,creed,datlog,iclbra,iclkid,idec,inset,root44,stl,ukc}!stc!andrew