crowl@cs.rochester.edu (Lawrence Crowl) (09/28/87)
Several posters in this group have pointed out the difficulty in satisfying the many national collating sequences within an international character code. There is a further problem in that if I wish to collate words from several languages (say a list of authors), then I must pick a collating method that probably does not include all characters. In short, I may be forced to use some local, non-standard collating sequence to handle all entries. How does your bibliographic database handle foreign authors? Does it drop accents that are not in your native alphabet? I submit that we need not only an international character code, but an international collating sequence as well. Such a sequence should be very simple. There should be no "double letter" rules or unnatural separation of accented letters from base letters. I see no reason not to embed the collating sequence within the numeric codes for the characters. For example, a character set meeting these criteria might have the following ordering: A a `A `a "A "a .A .a ... AE ae B b C c ,C ,c D d E e 'E 'e `E `e ... No international standard based on USASCII can meet this alphabet and still embed the collating sequence within the character codes. Note that many letter forms in Latin, Greek, and Cryllic are the same. It is possible to merge these three alphabets into a single alphabet. This will involve some re-ordering of the letters from at least two of the original alphabets, but not a great deal. I do not know whether this is a good idea or not, I just thought I would mention it. Of course, we still have Arabic, Hebrew, Kanji, Kana, etc. to incorporate. Perhaps a better approach is to start from scratch with a new character standard. One designed from the start to accomodate international needs. I am willing to translate my files to a new character set. Are you? -- Lawrence Crowl 716-275-9499 University of Rochester crowl@cs.rochester.edu Computer Science Department ...!{allegra,decvax,rutgers}!rochester!crowl Rochester, New York, 14627
sandi@apollo.uucp (Sandra Martin) (09/29/87)
Lawrence Crowl @ U of Rochester, CS Dept, Rochester, NY writes: >I submit that we need not only an international character code, but an >international collating sequence as well. Such a sequence should be very >simple. There should be no "double letter" rules or unnatural separation >of accented letters from base letters. I see no reason not to embed the >collating sequence within the numeric codes for the characters. > >For example, a character set meeting these criteria might have the following >ordering: > > A a `A `a "A "a .A .a ... AE ae B b C c ,C ,c D d E e 'E 'e `E `e ... I agree that an international collating sequence would be nice, but you can't make arbitrary rules against double letters and separating characters with diacriticals. In Spanish, 'ch' sorts between 'c' and 'd' in the alphabet (likewise, 'll' comes between 'l' and 'm'). How would your sequence handle this situation? You cannot ignore it just because it's inconvenient. In the Swedish alphabet, a(ring), a", and o" appear AFTER 'z'. They DO NOT sort with the unaccented a's and o's. In Danish, the 'ae' ligature also appears near the end of the alphabet. Why should an international collating sequence fail to recognize these realities? A few months back, Erland Sommarskog of ENEA Data in Stockholm posted an article to this newsgroup in which he noted (perhaps partly in jest) that if the Swedes had invented computers, English-speakers would have had to accept the fact that 'v' and 'w' are equivalent. As an English speaker, I'm sure you wouldn't want to accept such a restriction. Why should people from other countries have to accept an unnatural order for their characters? The fact is that there is no way to construct ONE international collating sequence. In German, the a" sorts with the other a's. In Swedish, it sorts at the end of the alphabet. So whatever solution is invented, it must be flexible enough to handle these realities. Sandra Martin, Apollo Computer UUCP: ...{mit-erl,mit-eddie,yale,uw-beaver,decvax}!apollo!sandi ARPA: apollo!sandi@eddie.mit.edu
crowl@cs.rochester.edu (Lawrence Crowl) (09/29/87)
In article <379119b2.b88e@apollo.uucp> sandi@apollo.uucp (Sandra Martin) writes: )Lawrence Crowl @ U of Rochester, CS Dept, Rochester, NY writes: )>I submit that we need not only an international character code, but an )>international collating sequence as well. Such a sequence should be very )>simple. There should be no "double letter" rules or unnatural separation )>of accented letters from base letters. ) )I agree that an international collating sequence would be nice, but you can't )make arbitrary rules against double letters and separating characters with )diacriticals. In Spanish, 'ch' sorts between 'c' and 'd' in the alphabet )(likewise, 'll' comes between 'l' and 'm'). How would your sequence handle )this situation? You cannot ignore it just because it's inconvenient. )[Followed by lots of examples of incompatabilty between national collating )sequences.] You've missed my point. No international character code will support the various national collating sequences. If we have an international collating sequence, ignoring national sequences, then we can have a very simple coding scheme which naturally supports a simple collating sequence. An international sequence tells me what to do when collating foreign words, etc. This leaves a programmer with two choices, sorting based on the international sequence or sorting based on his or her national sequence. Any international character code will make the latter difficult, but the former can be easy with a good character code and collating sequence pair. I am not suggesting forcing people to abandon national sequences, just giving them an international alternative that is easy and efficient. -- Lawrence Crowl 716-275-9499 University of Rochester crowl@cs.rochester.edu Computer Science Department ...!{allegra,decvax,rutgers}!rochester!crowl Rochester, New York, 14627
oster@dewey.soe.berkeley.edu (David Phillip Oster) (09/30/87)
Since different nations have different, incompatible collating sequences and any _international_ collating system could not simultaneously sort the same set into two lists at once, an _international_ collating sequence must be different from the national collating sequence. Since we aren't shackled by by the existing national collating sequences, we might as well make our new, international one simple. Hey, why not just sort by the numeric value of the ASCII code of the characters? That way, all our existing English language software already does the "right" thing.
crowl@cs.rochester.edu (Lawrence Crowl) (09/30/87)
In article <21031@ucbvax.BERKELEY.EDU> oster@dewey.soe.berkeley.edu.UUCP (David Phillip Oster) writes: >Hey, why not just sort by the numeric value of the ASCII code of the >characters? That way, all our existing English language software already does >the "right" thing. It doesn't do the "right" thing. No one I know says "Z" < "a". And what do we do about all the modified characters? Tack them on at the end in some random order? Ugly! Surely we can do something halfway rational. -- Lawrence Crowl 716-275-9499 University of Rochester crowl@cs.rochester.edu Computer Science Department ...!{allegra,decvax,rutgers}!rochester!crowl Rochester, New York, 14627
jeg@hector.UUCP (Judy Grass) (09/30/87)
>scheme which naturally supports a simple collating sequence. An international >sequence tells me what to do when collating foreign words, etc. This leaves a >programmer with two choices, sorting based on the international sequence or >sorting based on his or her national sequence. Any international character >code will make the latter difficult, but the former can be easy with a good >character code and collating sequence pair. > >I am not suggesting forcing people to abandon national sequences, just giving >them an international alternative that is easy and efficient. >-- > Lawrence Crowl 716-275-9499 University of Rochester > crowl@cs.rochester.edu Computer Science Department >...!{allegra,decvax,rutgers}!rochester!crowl Rochester, New York, 14627 Even given an international sequence, you will still have a problem. Your sequence is based on the roman alphabet. There are a LOT of languages that do not use that alphabet. A standardized transcription for each language will have to be chosen. I know of at least five methods of transcribing Russian that are considered standard for some purpose. Japanese has several different transcriptions. Chinese too. I don't know how to come up with one transcription system that will cover that kind of range of languages. My first impulse would be to use some variant of the IPA (international phonet alphabet), but transcriptions are spelling to spelling translations. Phonetic approaches aren't particular relevant. -- J. Grass ATT Bell Labs, Murray Hill NJ ulysses!jeg
srg@quick.COM (Spencer Garrett) (10/01/87)
In article <2706@sol.ARPA>, crowl@cs.rochester.edu (Lawrence Crowl) writes: > I submit that we need not only an international character code, but an > international collating sequence as well. Such a sequence should be very > simple. There should be no "double letter" rules or unnatural separation > of accented letters from base letters. I see no reason not to embed the > collating sequence within the numeric codes for the characters. Absolutely. > Note that many letter forms in Latin, Greek, and Cryllic are the same. It > is possible to merge these three alphabets into a single alphabet. This will > involve some re-ordering of the letters from at least two of the original > alphabets, but not a great deal. I do not know whether this is a good idea or > not, I just thought I would mention it. Of course, we still have Arabic, > Hebrew, Kanji, Kana, etc. to incorporate. Technically very difficult and probably politically impossible. > Perhaps a better approach is to start from scratch with a new character > standard. One designed from the start to accomodate international needs. > I am willing to translate my files to a new character set. Are you? I think this has seeds of a good idea, and I would be willing to shift to a new character set to accomplish it. I'd like to suggest that it's important for the alphabetic portion of the code to fit within 8 bits, though, or the storage cost associated with shifting to the new code will be prohibitive. This wouldn't have to include katakana or hiragana and couldn't possibly include kanji. The JIS presently uses two 7-bit codes per symbol and reaches them through a "shift-out" sequence from a more-or-less standard ASCII. There are way too many kanji to fit into 8 bits, and the notion of "collating sequence" doesn't really apply to them. (Actually, a clever encoding might make this a new "feature".) Katakana and hiragana couldn't coexist with anything else in 8 bits and they're presently encoded in 14 (really 16) bits, so retaining a 2-byte encoding wouldn't cause any pain. If we used an "escape to k-h" followed by a byte to encode the character itself, then these characters would at least collate together when mixed with this new international alphabet, and would collate correctly with each other, all without changing the semantics of strcmp(). (perhaps there should be a separate escape to each, but you get the idea.) Perhaps the escape to kanji would be followed by two 8-bit bytes? If the escape codes, at least, were standardized then terminals which weren't set up to handle kanji could at least know how to skip them and perhaps display an "unknown symbol" code in their place. The final (:->) problem is how to mix l->r and r->l "horizontal" writing with eastern "vertical" writing. Mixing the first two is tricky, but already being done. I have no idea how to add "vertical" to the list. Hmmm. It just occurred to me that rewriting all the western languages in a new alphabet and then trying to retain the existing japanese script is a bit inconsistent. It's not too hard to phoneticize japanese (they've done it 3 times already, once using the roman alphabet) so maybe they should just join us in using this mythical new alphabet. I don't know if this is possible for chinese and its relatives, however. I suspect it is not.
guy@gorodish.UUCP (10/01/87)
> I submit that we need not only an international character code, but an > international collating sequence as well. Such a sequence should be very > simple. Well, Esperanto is probably simpler than most natural languages (or, at least, simpler than most European languages), but it's certainly not taken over the world.... An international collating sequence could certainly be cooked up, but in practice who (other than programmers) would want it? I'd believe such a collating sequence could replace existing national collating sequences if the bulk of the people affected by it said it was OK. Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com
walters@io.UUCP (Tim Walters) (10/01/87)
In article <2752@sol.ARPA> crowl@cs.rochester.edu (Lawrence Crowl) writes: >I am not suggesting forcing people to abandon national sequences, just giving >them an international alternative that is easy and efficient. I'm afraid I can't see the advantage of a sorting sequence that's easy and efficient but doesn't sort letters the way you want. I would never use a routine which put, say, 'w' after 'z', even if it was efficient and followed accepted practice in Europe; yet this is what your proposed collating sequence would look like to someone in Sweden. National sequences aren't just a matter of local taste; they are THE way dictionaries, phone books, book indexes, and everything else are sorted in a particular country. Since there isn't really any acceptable common collating sequence, I would much rather see an efficient standardized routine which can collate according to any national standard. I would much rather use such a routine in my code, knowing that it could be configured to produce acceptable output for any country. -- ...!harvard!umb!ileaf!walters Tim Walters, Interleaf ...!sun!sunne!ileaf!walters Ten Canal Park, Cambridge, MA 02141 (617) 577-9813 x5510
aeb@mcvax.UUCP (10/02/87)
In article <29640@sun.uucp> guy%gorodish@Sun.COM (Guy Harris) writes: >An international collating sequence could certainly be cooked up, >but in practice who (other than programmers) would want it? In a bibliographic journal one is forced to list the authors or names of journals in some order, mixing names from many different languages and using many different types of diacritical marks. Thus, one has to define some "international collating sequence" in such a situation. Does G\*:odel come before or after Godsil? -- Andries Brouwer -- CWI, Amsterdam -- uunet!mcvax!aeb -- aeb@cwi.nl
guy%gorodish@Sun.COM (Guy Harris) (10/04/87)
> In a bibliographic journal one is forced to list the authors > or names of journals in some order, mixing names from many different > languages and using many different types of diacritical marks. > Thus, one has to define some "international collating sequence" > in such a situation. Does G\*:odel come before or after Godsil? Yes, but would you want your phone books sorted using this sequence? Unless you can eliminate *all* uses of national collating sequences when used by computers, an international collating sequence would only be able to *supplement*, not *replace*, national collating sequences. As such, you'd still have to have code to handle the national collating sequences; the international collating sequence would be yet another variant, along with all the national collating sequences. The bulk of the collating sequence problem would be unaffected by this international collating sequence. Also, if the primary intent of this sequence is to support bibliographies with authors and titles in multiple languages, it's not clear that overloading e.g. the glyph "H" with the meanings "aitch" in the Roman alphabet, "eta" in the Greek alphabet, and "en" in the Cyrillic alphabet would be necessary; would not such databases be, at least in countries using the Roman alphabet, Romanized? I don't know whether bibliographies in Greek or in languages using the Cyrillic alphabet Hellenize or Cyrilify (?) foreign names. Given that, would you need a single international character set and accompanying collating sequence? (For that matter, would the same bibliographical journal be sorted the same way when prepared in several different languages, or would the native collating sequence be used? Would the same bibliographical journal even *look* the same when prepared in different languages? "Moskva" is turned into "Moscow" in English, but is it turned into "Moscow" in other languages as well"?) Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com
walters@io.UUCP (Tim Walters) (10/05/87)
In article <1297@haddock.ISC.COM> karl@haddock.ima.isc.com (Karl Heuer) writes: >In article <393@io.UUCP> walters@wally.UUCP (Tim Walters) writes: >>I'm afraid I can't see the advantage of a sorting sequence that's easy >>and efficient but doesn't sort letters the way you want. I would never >>use a routine which put, say, 'w' after 'z'... > >Funny, I use one all the time (ASCII strcmp) that sorts 'a' after 'Z', even >though that's not the standard way to sort things in my native language >(American English). Well, I use strcmp quite a bit myself, mostly because it's there and easy to use. It does put 'a' after 'Z', but this usually isn't too much of a problem since most of my text is all caps, initial caps, or all lower case. That just means (at most) three ranges of text to look at, with everything ordered nicely within those ranges. Even so, I think in most cases I would prefer to call, say, 'nstrcmp' which sorted sorted things according to a (user configurable) standard dictionary ordering. >>National sequences ... are THE way dictionaries, phone books, book indexes, >>and everything else are sorted in a particular country. > >Even within a country, it's not completely consistent. German \(o" collates >as `o' in the dictionary, but `oe' in the phone book. I believe it has been >stated in this newsgroup that Dutch \(ij can sort as `ij', `y', or a letter >between `x' and `y'. You're right, it was a little too broad to say that there was only one standard per country. I had forgotten about the different sorting standards in Germany. I hadn't heard about the alternate sortings of the Dutch ij. There are probably other countries which have more than one way of sorting in certain contexts. I would argue, however, that this does not mean that people can easily, or happily, adapt to a new sorting standard; rather, I think it means that end users would like to select the sorting sequence themselves. There is a similar diversity in the national preferences for formats of dates. A single standard output format for dates might be acceptable in a few cases, but most people will prefer to see them written the way they're used to seeing them. -- ...!harvard!umb!ileaf!walters Tim Walters, Interleaf ...!sun!sunne!ileaf!walters Ten Canal Park, Cambridge, MA 02141 (617) 577-9813 x5510
gnu@hoptoad.uucp (John Gilmore) (10/05/87)
aeb@cwi.nl (Andries Brouwer) wrote: > Thus, one has to define some "international collating sequence" > in such a situation. Does G\*:odel come before or after Godsil? One possible solution to this problem is to define multiple code values with the same graphic image, but different sorts. In other words, if there are seventeen languages that use a' and it sorts in four different positions, give it four codes, and depend on the typist to enter the right code. (Actually, you're depending on the keyboard translation table most of the time, which should be right for your country.) This would even make names from different languages sort properly; e.g. in the right place for their native language. Speakers of other languages would get confused about where to look, though. It also implies an exhaustive research effort and puts constraints on new languages. Personally I wouldn't mind changing over to a new international alphabet where American "w" sorted after "z". Of course, the change would be gradual; international publications such as newspapers would do it first, and it would eventually spread to the rest of the society as everything became more international, and as people got used to it. Like the changeover of Romanized Chinese systems a few years ago (Peking->Beijing). The most important aspect about such a change, for me, is that I'd only want to do it once. PS: Many publications already have indices containing characters with no well-defined sorting order, e.g. symbols and numbers. Take any computer science textbook as example; where does "/*EOF" sort? Where do you find "3Com" in the phone book? (Mountain View, I know :-) -- {dasys1,ncoast,well,sun,ihnp4}!hoptoad!gnu gnu@toad.com
kimcm@ambush.UUCP (Kim Chr. Madsen) (10/06/87)
In article <363@zuring.cwi.nl> aeb@cwi.nl (Andries Brouwer) writes: >Thus, one has to define some "international collating sequence" >in such a situation. Does G\*:odel come before or after Godsil? Or more than that, assume that you'll have to sort the names of authors according to surnames and the name "A. J. Dijon" appeared in the list how should the "ij" be interpreted as Dutch "y" as spanish "ij" or as two separate letters - well that depends on the origin of Mr. Dijon so you'll probably have to have some semantic put into the list of names to make it work like "A. J. Dijon (<national-code>)", and probably you will have to use specialized tools to sort such a list so ... International collating sequence may be a good thing but we need more than that to make it work. Kim Chr. Madsen.
agc@ist.UUCP (Alistair G. Crooks) (10/09/87)
In article <363@zuring.cwi.nl>, aeb@cwi.nl (Andries Brouwer) writes: > In article <29640@sun.uucp> guy%gorodish@Sun.COM (Guy Harris) writes: > >An international collating sequence could certainly be cooked up, > >but in practice who (other than programmers) would want it? > > [...bibliographic journal example deleted...] > Thus, one has to define some "international collating sequence" > in such a situation. Does G\*:odel come before or after Godsil? > Andries Brouwer -- CWI, Amsterdam -- uunet!mcvax!aeb -- aeb@cwi.nl With all these thoughts about each separate language's character-sorting properties, and all the talk in comp.arch on shared libraries (thankfully dying out now), led me to thinking: Why not strip the string routines from libc into a shared library, that can be dynamically linked at run time? Or even just strcmp() and strncmp()? This would mean manufacturers would have to write these routines once for each language. Yes, I know that not everyone has shared libraries (yet), but in a few months or years time? Comments, ideas, anyone? Alistair G. Crooks (agc@ist.co.uk or ...!mcvax!ist!agc)
karl@haddock.ISC.COM (Karl Heuer) (10/13/87)
In article <1483@ist.UUCP> agc@ist.UUCP (Alistair G. Crooks) writes: >Why not strip the string routines from libc into a shared library, >that can be dynamically linked at run time? Or even just strcmp() >and strncmp()? Actually, a shared library normally gives you the choice at startup-time rather than run-time. I don't think it's appropriate to replace strcmp(). Most uses of strcmp() are to test two strings for exact equality; these should be left alone. The ANSI C library includes a new function (it's "strcoll()" in the Oct86 dpANS; but I think it may have changed since then) which will "digest" any string in a locale-specific way. (For example, in German-Telephone-Directory mode it could map "Schr\(o"der" to "schroeder".) This seems like a good approach. Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
guy%gorodish@Sun.COM (Guy Harris) (10/13/87)
> Why not strip the string routines from libc into a shared library, > that can be dynamically linked at run time? Or even just strcmp() > and strncmp()? Because: 1) Some systems will permit you to link a program completely statically, so they won't be affected by changes to the shared library. 2) This makes it harder to add support for new collating sequences, as the person adding this support has to build an entirely new shared library. 3) This also makes it harder for a program to *change* the current language in midstream; one could imagine a program (e.g., a multilingual word processor) wanting to do so. Also, changing "strcmp" would cause problems, because some program might want to support *both* a particular natural language's sorting order and the "native" byte-string sorting order. The X3J11 committee developing the ANSI C has already come up with schemes to support sorting orders other than the native byte-string order; these schemes permit programs to change the sorting order on-the-fly, and to use "strcmp" directly if this is called for. Typical implementations will, one hopes, load the sorting order information from a file, based on the current locale. This file will be tailorable by people without the source code, so that, for example, a vendor could distribute the system worldwide and have people in each country tailor it for their environment (since developers at the home office may not know that country's environment as well as the people in that country). Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com
bert@aiva.ed.ac.uk (Bert Hutchings) (10/14/87)
Here's a Scottish contribution to this topic. The British Post Office takes an apparently cavalier, but entirely practical, approach to sorting Scottish surnames beginning with 'Mac' in their telephone directories - every variant spelling of this prefix is sorted as Mac, and the case of the next letter is ignored, but the name is printed as the subscriber prefers to use it. Thus MacDonald McEnroe M`Farquhar Macgillicuddy Machine Tool Hire Ltd. Mcilwraith M`indoe are in collated sequence. I think that special cases like this, and like the common requirement to elide 'a', 'the' etc, support the negative view that an underlying almost-ready-to-use character collation sequence is so small a component of every desired end product that it isn't really worth the effort.
bob@its63b.ed.ac.uk (ERCF08 Bob Gray) (10/16/87)
In article <176@aiva.ed.ac.uk> bert@aiva.ed.ac.uk (Bert Hutchings) writes: >surnames beginning with 'Mac' in their telephone directories - every variant >spelling of this prefix is sorted as Mac, and the case of the next letter is >ignored, but the name is printed as the subscriber prefers to use it. Thus > > MacDonald > McEnroe > M`Farquhar > Macgillicuddy > Machine Tool Hire Ltd. > Mcilwraith > M`indoe Just to confuse things further, some women whose family name begins with the Mac (meaning son of) prefix, are insisting on being known by the Nic (meaning daughter of) prefix. yet more special cases to be taken care of. > ... support the negative view that >an underlying almost-ready-to-use character collation sequence is so small a >component of every desired end product that it isn't really worth the effort. Any collation sequence would also have to be easily re-defined at a user level to indicate local changes in sequence, or changes with time. Any product could have an "international" sequence but the options should always be there to easily override the default options. Bob.