sommar@enea.UUCP (Erland Sommarskog) (08/12/87)
Two things have inspired me to this article: 1) The reading of the (proposed) standards ISO/Latin 1-4. 2) The discussion "What is a byte". Reading the standards you discover that there is a whole lot of letters you never dreamt of, but still there is something common. (I'm only talking latin letters, but it applies to Greek and Cyrllian as well.) With some few exceptions it is the same letters that reappears, they are just modified in some ways. They have accents, cedillas, ring, dots, strokes etc. Thus, many are combinations of two or more characters. The standards is an attempt to satisfy the requirements for the different languages by assigning each combination an integer value. But isn't a character a more complicated data type than just a simple enumeration type? In some languages the combination may constitute a new letter ("a" with ring and dots, "o" with dots in Swedish), in other you can apply accents and other signs without affecting the sorting. (E.g. French, Italian) I think that the simple represenatation for charcters is completely due the dominating position of the English language in the computer world. If computers had been invented in France the problem would have been solved. (And if they had been Swedish, Englishmen would have to accept "v" and "w" being equivalent.) The conclusion is that a more sofisticated approach muct be taken. However, I must admit that I do not have any bright proposals right now, yet think of it!-- Erland Sommarskog ENEA Data, Stockholm sommar@enea.UUCP
gordan@maccs.UUCP (Gordan Palameta) (08/13/87)
In article <2171@enea.UUCP> sommar@enea.UUCP(Erland Sommarskog) writes: > >value. But isn't a character a more complicated data type >than just a simple enumeration type? In some languages the >combination may constitute a new letter ("a" with ring and dots, >"o" with dots in Swedish), in other you can apply accents and >other signs without affecting the sorting. (E.g. French, Italian) > I think that the simple represenatation for charcters is completely >due the dominating position of the English language in the computer >world. If computers had been invented in France the problem would >have been solved. (And if they had been Swedish, Englishmen would It gets even more complicated: in Spanish, I believe, ch is considered a separate letter, between c and d in alphabetical order (likewise with ll). It only goes to show that alphabetical order is language-dependent, and identical strings will sort differently depending on locale. The only general solution is to have intelligent operating system routines to handle sorting. Despite 7-bit ASCII, which makes possible code such as if (c >= 'A' && c <= 'Z') there is no reason why the numeric representation of a character should have anything to do with the position of that character in a collating sequence. -- UUCP: ... !mnetor!lsuc!maccs!gordan BITNET: GP@TANDEM "Sumasshedshii vsekh stran, soyedinyaites'" Gordan Palameta
henry@utzoo.UUCP (Henry Spencer) (08/14/87)
> I think that the simple represenatation for charcters is completely > due the dominating position of the English language in the computer > world. If computers had been invented in France the problem would > have been solved... Surely you jest! If the French had invented computers, the official rationale for FRSCII (or whatever it would be called :-)) would go to great lengths to explain why cedillas were part of God's plan but no civilized human being would ever put pairs of dots above letters! The best summary I've ever heard was "those who speak English don't really believe that other languages exist; those who speak French know that other languages exist, but can't understand why". -- Support sustained spaceflight: fight | Henry Spencer @ U of Toronto Zoology the soi-disant "Planetary Society"! | {allegra,ihnp4,decvax,utai}!utzoo!henry
sommar@enea.UUCP (08/15/87)
In a recent article gordan@maccs.UUCP (Gordan Palameta) writes: >In article <2171@enea.UUCP> sommar@enea.UUCP(Erland Sommarskog) writes: >> I think that the simple represenatation for charcters is completely >>due the dominating position of the English language in the computer >>world. If computers had been invented in France the problem would >>have been solved. (And if they had been Swedish, Englishmen would > >It gets even more complicated: in Spanish, I believe, ch is considered >a separate letter, between c and d in alphabetical order (likewise with ll). Perfectly true. And Spanish is not unique. Polish, for instanc, have cz, dz and rz. >It only goes to show that alphabetical order is language-dependent, and >identical strings will sort differently depending on locale. The only >general solution is to have intelligent operating system routines to >handle sorting. It would be preferably to have the sorting as part of the langauge in question. For example in Ada: pragma LANGUAGE(French) The support may be in the OS - or even the hardware for speed - but making part of the language increases portability. But this doesn't address all problems I mentioned. How to construct a general character with an arbitrary accent, umlaut or other diacritic mark? An 8-bit enumarate isn't sufficient. >Despite 7-bit ASCII, which makes possible code such as > if (c >= 'A' && c <= 'Z') >there is no reason why the numeric representation of a character should have >anything to do with the position of that character in a collating sequence. Right, but almost all programming today depends on it, isn't it so? It's easier to implement and executes faster. The character type should be an abstract one. The actual implementation (bit size and all) could vary from compiler to compiler, from OS to OS. The simple numeric representation happens to work for English. For French it doesn't. If coumputers had been invented in France.... -- Erland Sommarskog ENEA Data, Stockholm sommar@enea.UUCP
root@hobbes.UUCP (08/15/87)
+---- Erland Sommarskog writes the following in article <2171@enea.UUCP> ---- | In some languages the combination may constitute a new letter | ("a" with ring and dots, "o" with dots in Swedish), in other you | can apply accents and other signs without affecting the sorting. | (E.g. French, Italian) | The conclusion is that a more sofisticated approach muct be taken. | However, I must admit that I do not have any bright proposals right | now, yet think of it!-- +---- I will be posting (as soon as I finish this note) a routine which we use called stracmp() to the newsgroup comp.sources.misc. It compares two strings of 8 bit characters while taking into account the correct collating sequence and precedence (if any) of accented letters. The routine is designed to drop in in place of the common strcmp(). The code has been used on IBM-PCs which support a limited set of accented characters in their character display ROMS and also with the ISO-Latin-1 alphabet (this requires more sophisticated display drivers). It would be very simple to add the tables for Latin-2 through n if desired. The code is not dependent on any particular hardware, but it does assume the C compiler handles "unsigned chars". I am including some of the comments I made in the header to the code: Description: stracmp() implements a string compare which correctly handles accented (non English) characters which have been encoded using 8-bit characters. It uses character lookup tables for doing string compares when accented characters are present and/or a non-ASCII collating sequence is desired. Theory: The correct way of sorting (or comparing) strings which contain accented characters is to first compare the strings with all accents stripped. If the two strings are the same, then and only then are the accents used. This second comparison involves only the accents. You can think of this as comparing the two strings with all the letters stripped. Also, there are times when the "normal" ASCII collating sequence is not appropriate for lexical ordering. (ie. A <AE> B C <CEDILLA> D ...> Examples: , : Comparing Junta and Junta (the second word has diacritical marks over the two vowels) first we compare("Junta", "Junta") which shows them EQUAL then we must compare(" ", " ' :") , : Thus, Junta comes before Junta in the lexical ordering of the two words. , , Comparing Junta and Junto (both words have accented 'u's) first we compare("Junta", "Junto"); since they are different we do not need to do anything more with the accents: , , "Junta" is less than "Junto". -- John Plocher uwvax!geowhiz!uwspan!plocher plocher%uwspan.UUCP@uwvax.CS.WISC.EDU
lambert@mcvax.UUCP (08/15/87)
In article <8410@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
) Surely you jest! If the French had invented computers, the official
) rationale for FRSCII (or whatever it would be called :-)) would go to
) great lengths to explain why cedillas were part of God's plan but no
) civilized human being would ever put pairs of dots above letters!
But the French do! Best known example: <<Noe"l>> (Xmas) with a dieresis on
the <<e>>. Other examples: <<contigue">> (female form of <<contigu>> =
contiguous), <<mai"s>> (corn), <<cycloi"de>>.
This by itself does not prove anything about the relationship between the
French and civilized human beings one way or the other, but it makes it
implausible that they would argue as suggested.
) Support sustained spaceflight: fight | Henry Spencer @ U of Toronto Zoology
There I was believing you made these hilarious signature lines on purpose,
but now I see the ambiguity was probably unintentional. I'll pipe the
output of my fight through you.
--
Lambert Meertens, CWI, Amsterdam; lambert@cwi.nl
gordan@maccs.UUCP (Gordan Palameta) (08/17/87)
In article <47@piring.cwi.nl> lambert@cwi.nl (Lambert Meertens) writes: >In article <8410@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >) [if the French had invented computers, cedillas would be considered part >) of "God's plan", but not diereses] > >But the French do! [examples follow] If the French had invented computers, there is little doubt that a character set to support French would have appeared sooner. Such a set might have supported German as well, and other Western European languages, but it is hardly likely that it would have supported (in decreasing order of probability) Scandinavian languages, Eastern European languages, Japanese, Icelandic eth(?) and thorn(?), Cyrillic, Arabic, Hebrew, etc. There is no chance that a 16-bit character set would have sprung up, fully formed -- computer memory used to be very, very expensive (weren't characters six bits, once upon a time?) In short, it is difficult to see how the situation would have evolved several decades after the introduction of FRenchSCII to a point much different from what we have today (the ISO 8-bit ASCII sets, JIS standards for Kanji, etc.) >) Support sustained spaceflight: fight | Henry Spencer @ U of Toronto Zoology > > [humorous comment] Support two line signatures, the money you save may be your own. -- UUCP: ... !mnetor!lsuc!maccs!gordan BITNET: GP@TANDEM "Sumasshedshii vsekh stran, soyedinyaites'" Gordan Palameta
gordan@maccs.UUCP (Gordan Palameta) (08/17/87)
In article <2183@enea.UUCP> sommar@enea.UUCP(Erland Sommarskog) writes: >In a recent article gordan@maccs.UUCP (Gordan Palameta) writes: >>In article <2171@enea.UUCP> sommar@enea.UUCP(Erland Sommarskog) writes: > >But this doesn't >address all problems I mentioned. How to construct a general character >with an arbitrary accent, umlaut or other diacritic mark? An 8-bit >enumarate isn't sufficient. t umlaut or q cedilla would probably be used very rarely, nor is it likely that anyone would go to the trouble of designing a font to accomodate such characters. Another cost of such generality would be that accents and other marks would probably have to be indicated by escape sequences in conjunction with the unmodified letter. This would make string-processing software more complicated (and slower), and text would be longer. >>Despite 7-bit ASCII, which makes possible code such as >> if (c >= 'A' && c <= 'Z') >>there is no reason why the numeric representation of a character should have >>anything to do with the position of that character in a collating sequence. > >Right, but almost all programming today depends on it, isn't it so? >It's easier to implement and executes faster. Not at all, just define a 256-byte lookup table in an include file, and modify the code to if (coll[c] >= FIRST_CHAR && coll[c] <= LAST_CHAR) with very little loss of efficiency. To accomodate perverse languages like Spanish and Polish which insist on two-letter combinations for sorting, this won't do however: change the square brackets to round ones (with some loss of efficiency, but very little inconvenience in coding). (well some inconvenience, c can't be a single character any more). Never mind the French; what if things had turned out differently in 1588 with the Armada, and the Spanish had invented computers? Or the Chinese? Followups to alt.universes. -- UUCP: ... !mnetor!lsuc!maccs!gordan BITNET: GP@TANDEM "Sumasshedshii vsekh stran, soyedinyaites'" Gordan Palameta
frisk@askja.UUCP (08/18/87)
In article <176@hobbes.UUCP> root@hobbes.UUCP (John Plocher) writes: >I will be posting (as soon as I finish this note) a routine which we use >called stracmp() to the newsgroup comp.sources.misc. It compares two >strings of 8 bit characters while taking into account the correct collating >sequence and precedence (if any) of accented letters. The routine is >designed to drop in in place of the common strcmp(). Now - the problem with this is that there is no "correct" collating sequence for all languages. One example is the position of character 197 in Latin-1 (A with a circle above). In some languages it is the first letter in the alphabet, in other one of the last. The method described in this article would work partially with Icelandic, but not quite. To see why, consider the Icelandic alphabet. A 'A B D (ETH) E 'E F G H I 'I J K L M N O 'O P R S T U 'U V X Y 'Y (THORN) (AE) (o with two dots above) Ot these 32 characters, the last two would end up in the wrong places. What is needed is either: a strcpy(string1,string2,LANGUAGE) function or a strcpy_LANGUAGE(string1,string2) -- Fridrik Skulason Univ. of Iceland, Computing Center UUCP ...mcvax!hafro!askja!frisk BIX frisk "This line intentionally left blank"
dan@sics.UUCP (08/18/87)
In article <176@hobbes.UUCP> root@hobbes.UUCP (John Plocher) writes: >Theory: > The correct way of sorting (or comparing) strings which contain > accented characters is to first compare the strings with all accents > stripped. If the two strings are the same, then and only then are the > accents used. This second comparison involves only the accents. > You can think of this as comparing the two strings with all the letters > stripped. Sorry, that is not the correct way to sort in Swedish. The three letters a(with a ring), a(with two dots), o(with two dots) always come at the end of the alphabet, after z. Traditionally v and w are also considered the same letter in Swedish. There is unfortunately no universal way to sort things alphabetically, each language has its own ways. This fact has for instance been incorporated into the Macintosh system, where there sorting depends on the national version. Dan Sahlin (dan@sics.uucp)
pom@under..ARPA (Peter O. Mikes) (08/19/87)
To: gordan@maccs.UUCP Subject: Re: Character representation Newsgroups: comp.std.internat, sci.lang In-Reply-To: <719@maccs.UUCP> In article <719@maccs.UUCP> you write: >Followups to alt.universes. I am sorry, but according to latest QM, the multiple universes not only keep splitting, they also merge. This happens to be one such feedback from Alternative Universe. Besides I have VERY CONSTRUCTIVE ( insiders info ) FACT on ceddilas, umlats, haceks, and other such ... [modifiers] namely : In all langauges I know, there are many kinds, but ANY PARTICULAR LETTER either has one - or it does not. That means that we need to reserve just 1 bit ( 0.. unmodified) and (1.. modified). to take care of dozens of languages. So e.g. if switch ( ROM, printwheel,..) is set to German , modified o will put two dots ( umlaut) above o; In Czech the same bit will put ' above 'aeiou' but will put inverted ^ over consonants ( since only 'aeiou' are allowed to have ' and only consonants can have ^, and so it goes. >>But this doesn't >>address all problems I mentioned. How to construct a general character >>with an arbitrary accent, umlaut or other diacritic mark? An 8-bit >>enumarate isn't sufficient. The problem you (somebody) mentioned is hereby addressed. To disprove my conjecture, name one language with Latin-based alphabet and one letter in that alphabet, which admits more then one modifier. Oh, just BTW - using poor ASCII, which has no modifier bit, I am using the convention that modifier is indictaed by h ( e.g. a word: (modified_s)ot would appear as shot. (which is quite wastefull as whole h is needed to perform function of one bit). ( I am not quite sure if all mono-anglo-phones realise that english is actually using pairs for sounds ( english sh is perverse Hungarian's sz is actualy one sound (soft s or s^). The difference is mostly in that english is ambiguous and arbitrary and (on the positive side) makes collating based on singles ( but anybody can accept that, since you get your pairs sz - sorted in same sequence (almost) always anyway. . There is no reason why we couldn't use a huffman encoding >scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th >pattern is a filler, and the 16th pattern means that the next byte In a related posting >--- David Phillip Oster --My Good News: "I'm a perfectionist." >Arpa: oster@dewey.soe.berkeley.edu --My Bad News: "I don't charge by the hour." WRITES > There is no reason why we couldn't use a huffman encoding >scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th >pattern is a filler, and the 16th pattern means that the next byte >encodes the 254 next most common ideograms, the 255 bit pattern ............................... >this idea would also work for English. Assuming that the average >English word takes 6*8 bits (average length of 5 + terminating space >* 8 bit ascii) you could cut the disk space required for computer.. and I SAY, there is a reason: I would like to propose a criterion for ( or attribute of) coding of text. Coding is LOCAL (within n) if from each 3n bytes I can derive one (middle) letter of the encoded text. In this sense, the coding based on pairs (polish, spanish, sh for s^ etc are all local (within 2) but coding based on frequency of words is not (beside being language dependent). ( Please recall that I consider ideographs to be 'words' made of strokes.) The coding based on frequency of characters is Local, (and if we accept the above explained modifier-bit convention) also Language independent. I do believe that since we are discussing CHARACTER sets - we should leave out the coding based an dictionaries (word sets) - they have their funnction - but are much more (application, language, etc ) dependent than the character sets. Lets reach some agreement on letters first. Yours Dr. pom - a scientist - (quite mad) pom@under.s1.gov
alin@sunybcs.uucp (Alin Sangeap) (08/19/87)
In article <15381@mordor.s1.gov> pom@s1-under.UUCP () writes: |Besides I have VERY CONSTRUCTIVE |( insiders info ) FACT on ceddilas, umlats, haceks, and other such ... | [modifiers] namely : In all langauges I know, there are many kinds, | but ANY PARTICULAR LETTER either has one - or it does not. That means | that we need to reserve just 1 bit ( 0.. unmodified) and (1.. modified). | to take care of dozens of languages. | To disprove my conjecture, name one language with Latin-based alphabet | and one letter in that alphabet, which admits more then one modifier. | Romanian (of the latin group of languages) Letter a: a (plain) a with a paranthesis facing upwards above it a with a french-style accent circumflex (the upper half of a 45-degree-tilted square) f i l l e r -- Alin Sangeap State U. of N.Y. @ Buffalo C.S. INTERNET: alin@cs.buffalo.edu BITNET: alin@sunybcs.bitnet UUCP: {allegra,ames,boulder,decvax,rocksanne,rutgers,watmath}!sunybcs!alin NSA: please decode all secret cryptography ciphers; best of wishes, A.
wales@ucla-cs.UUCP (08/19/87)
In article <15381@mordor.s1.gov> pom@s1-under.UUCP () writes: >Besides I have VERY CONSTRUCTIVE (insiders info) FACT on cedillas, >umlauts, haceks, and other such ... [modifiers] namely : In all lan- >guages I know, there are many kinds, but ANY PARTICULAR LETTER either >has one - or it does not. That means that we need to reserve just 1 >bit (0.. unmodified) and (1.. modified). to take care of dozens of >languages. >To disprove my conjecture, name one language with Latin-based alphabet >and one letter in that alphabet, which admits more than one modifier. Good try, really, but there are several counterexamples: Czech. "U" can have an acute accent, or a small circle. Also, "E" can have either an acute accent or a "hacek" (V-like accent). French. "E" can have an acute, grave, or circumflex accent, or a diaeresis (two dots). "A" and "U" can have either a grave or a circumflex accent. "I" can have either a circumflex accent or a diaeresis. Hungarian. "O" and "U" can have a regular acute accent, a regular umlaut (two dots), or a "long" umlaut (two acute accents). Polish. "Z" can have an acute accent, or a single dot. Romanian. "A" can have a breve ("short" sign, like a small U) or a circumflex. Swedish. "A" can have either an umlaut, or a small circle. Vietnamese. There are several different kinds of accent marks used in this language to indicate tones (syllable pitch patterns), and as far as I'm aware, any of these accents may occur on any vowel. (And, yes, modern Vietnamese *does* use the Latin alphabet.) It may or may not be relevant, for purposes of this discussion, to note that some of the above languages treat the "modified" versions of their letters as completely distinct letters in their own right. -- Rich Wales // UCLA Computer Science Department // +1 213-825-5683 3531 Boelter Hall // Los Angeles, California 90024-1596 // USA wales@CS.UCLA.EDU ...!(ucbvax,rutgers)!ucla-cs!wales "Sir, there is a multilegged creature crawling on your shoulder."
rob@pbhye.UUCP (Rob Bernardo) (08/19/87)
In article <15381@mordor.s1.gov> pom@s1-under.UUCP () writes:
+( insiders info ) FACT on ceddilas, umlats, haceks, and other such ...
+ [modifiers] namely : In all langauges I know, there are many kinds,
+ but ANY PARTICULAR LETTER either has one - or it does not. That means
+ that we need to reserve just 1 bit ( 0.. unmodified) and (1.. modified).
+ to take care of dozens of languages.
Not so.
French allows ' and ^ over all vowels and ` additionally over "e".
Hungarian allows two dots, ', and '' over "u" and "o".
Spanish allows ' and two dots over "u".
Just to name a few off the top of my head. :-)
--
I'm not a bug, I'm a feature.
Rob Bernardo, San Ramon, CA (415) 823-2417 {pyramid|ihnp4|dual}!ptsfa!rob
sandi@apollo.uucp (Sandra Martin) (08/19/87)
Peter O. Mikes @ S-1 Project, LLNL writes: > So e.g. if switch ( ROM, printwheel,..) is set to German , modified o will > put two dots ( umlaut) above o; In Czech the same bit will put ' above > 'aeiou' but will put inverted ^ over consonants ( since only 'aeiou' > are allowed to have ' and only consonants can have ^, and so it goes. > >>But this doesn't >>address all problems I mentioned. How to construct a general character >>with an arbitrary accent, umlaut or other diacritic mark? An 8-bit >>enumarate isn't sufficient. > > The problem you (somebody) mentioned is hereby addressed. > To disprove my conjecture, name one language with Latin-based alphabet > and one letter in that alphabet, which admits more then one modifier. French allows an 'e' to take an acute or grave accent, as well as a circumflex and an umlaut. Examples: e'cole (school) privile`ge (privilege) e^tre (to be) Noe"l (Christmas) The 'a', 'i', and 'u' in French also can take multiple diacriticals. And in Swedish, the 'a' can take a ring or an umlaut. I imagine there are other examples, but these are the ones I could think of off the top of my head. Sandra Martin, Apollo Computer UUCP: ...{mit-erl,mit-eddie,yale,uw-beaver,decvax!wanginst}!apollo!sandi ARPA: apollo!sandi@eddie.mit.edu
joe@haddock.ISC.COM (Joe Chapman) (08/19/87)
>Besides I have VERY CONSTRUCTIVE >( insiders info ) FACT on ceddilas, umlats, haceks, and other such ... > [modifiers] namely : In all langauges I know, there are many kinds, > but ANY PARTICULAR LETTER either has one - or it does not. > To disprove my conjecture, name one language with Latin-based alphabet > and one letter in that alphabet, which admits more then one modifier. I don't even have to get obscure: in French an ``e'' can take one of three accents (grave, acute, circumflex) or a diaeresis. > english sh is perverse Hungarian's sz > is actualy one sound (soft s or s^). Minor quibble: in Hungarian, sz is pronounced like the s in English "soup", and s is pronounced as in "shop". -- Joe Chapman harvard!ima!joe
scottha@athena.TEK.COM (Scott Hankerson) (08/19/87)
In article <15381@mordor.s1.gov> pom@s1-under.UUCP () writes: >>>But this doesn't >>>address all problems I mentioned. How to construct a general character >>>with an arbitrary accent, umlaut or other diacritic mark? An 8-bit >>>enumarate isn't sufficient. > > The problem you (somebody) mentioned is hereby addressed. > To disprove my conjecture, name one language with Latin-based alphabet > and one letter in that alphabet, which admits more then one modifier. In French, one can have up to four different modifiers over a particular letter (the letter e can have a grave accent, one going the other direction (I never can remember what they're called in English (')), a circumflex, or an umlaut. In addition, I may want to quote from other languages if I write in French. > Oh, just BTW - using poor ASCII, which has no modifier bit, I am > using the convention that modifier is indictaed by h ( e.g. > a word: (modified_s)ot would appear as shot. (which is quite wastefull > as whole h is needed to perform function of one bit). Surely this would introduce even more ambiguities. In German, an h lengthens the vowels. Is a vowel followed by an h an umlauted vowel? An umauted vowel followed by an h? Or simply a vowel followed by an h? I haven't seen anyone mention an ISO standard yet. I was under the impression that there was one. Am I wrong? I don't much care for the alternates that I have seen used by terminal manufacturers in the US which is a keyboard with many of the special symbols replaced with accented characters. That may be nice for writting documents, but it must be intollerable for coding in C or any other programming language which uses many nonalphabetic symbols.
sommar@enea.UUCP (08/19/87)
In a recent article gordan@maccs.UUCP (Gordan Palameta) writes: >t umlaut or q cedilla would probably be used very rarely, nor is it likely >that anyone would go to the trouble of designing a font to accomodate such >characters. Another cost of such generality would be that accents and other >marks would probably have to be indicated by escape sequences in conjunction >with the unmodified letter. This would make string-processing software more >complicated (and slower), and text would be longer. If you had something like the 8th bit meaning that the following byte is a modifier, this would quite moderately increase the length of the text and the string-processing time. This solution does however not solely address the problem that different languages have different collating sequences. >Not at all, just define a 256-byte lookup table in an include file, and modify >the code to > if (coll[c] >= FIRST_CHAR && coll[c] <= LAST_CHAR) >with very little loss of efficiency. To accomodate perverse languages like >Spanish and Polish which insist on two-letter combinations for sorting, Spanish and Polish aren't more perverse than English. Of course I know about look-tables. I have myself written a programme that uses a two-level look-up table for comparing words. (And the words are transcribed in three levels. You don't want the hyphen in a hyphenated word to be significant.) But to have that in every single programme that does string comparisons. No, thank you. It does increase the complexity and the readability of the code. It would be much more nice if "ch1 >= ch2" meant that ch1 comes before or at the same position as ch2 in alphabet we currently have chosen. (It's unclair what equality is when modified letter are involved. Probably you will need two kinds of equality.) >Never mind the French; what if things had turned out differently in 1588 >with the Armada, and the Spanish had invented computers? Or the Chinese? I just took the Frenchmen as an example, OK? No matter who had invented the computers; if their language also had had the dominating position that English have, that language would have set the standard for character representation with no other language in mind. I took French as example since they have plenty of for sorting non-significant modifiers. >Followups to alt.universes. If you find the subject that uninteresting, why did you ever write the article at all? -- Erland Sommarskog ENEA Data, Stockholm sommar@enea.UUCP
alan@pdn.UUCP (Alan Lovejoy) (08/20/87)
In article <15381@mordor.s1.gov> pom@s1-under.UUCP () writes: > To disprove my conjecture, name one language with Latin-based alphabet > and one letter in that alphabet, which admits more then one modifier. French: "e" can have either a left overstrike, a right overstrike or a hat "^". Admittedly, only the right overstrike changes the pronunciation in MODERN Parisian French. However, in Navajo vowels must simultaneously be markable for nasality by a cedilla, as well as by diacritical marks to indicate other variations. All vowels can be nasal/nonnasal and voiced/unvoiced, and I believe there are exotic languages with even other variations (I'll have to look that up, though). Things get even stickier when you consider the problem of multilingual text, however (e.g, "I said to him, 'Who are you'? To which he answered, 'Je ne parle pas anglais. Parlez-vous francais'?"). --Alan "Bozhe moi! Kommitjet Gosudarstvjenoj Bjezopastnostji sljedujet...!!!!" Lovejoy
dean@hyper.UUCP (Dean Gahlon) (08/20/87)
in article <15381@mordor.s1.gov], pom@under..ARPA (Peter O. Mikes) says:
]
] The problem you (somebody) mentioned is hereby addressed.
] To disprove my conjecture, name one language with Latin-based alphabet
] and one letter in that alphabet, which admits more then one modifier.
] [random stuff deleted]
]
] ( I am not quite sure if all mono-anglo-phones realise that english is
] actually using pairs for sounds ( english sh is perverse Hungarian's sz
] is actualy one sound (soft s or s^).
You mentioned it yourself - Hungarian. (O, for instance, has
both long and short umlauts).
sommar@enea.UUCP (Erland Sommarskog) (08/21/87)
This was intented to go by mail, but it came back to me. (Athena was an unknown host.) In article <1583@athena.TEK.COM> you write: >I haven't seen anyone mention an ISO standard yet. I was under the impression >that there was one. Am I wrong? You must have missed it. Tim Lasko from DEC wrote an article on the status of the eight ISO standards. That was in comp.std.internat some weeks ago. The ISO standards are not sufficient. I don't whether you have read my articles in comp.std.internat where I discussed the need for another concept for character represenation. I find it quite inconvient to find the end of my alphabet somewhere at code 200. > I don't much care for the alternates >that I have seen used by terminal manufacturers in the US which is a >keyboard with many of the special symbols replaced with accented characters. >That may be nice for writting documents, but it must be intollerable for >coding in C or any other programming language which uses many nonalphabetic >symbols. Having screens with national characters replacing barckets and braces is no problem. Many languages that uses these characters allows alternatives. (E.g. [] can be replaced by (..) in most Pascal dialects.) Languages that does not provide alternatives, I simply refuse to use. -- Erland Sommarskog ENEA Data, Stockholm sommar@enea.UUCP
andersa@kuling.UUCP (Anders Andersson) (08/22/87)
In article <719@maccs.UUCP> gordan@maccs.UUCP (Gordan Palameta) writes: >t umlaut or q cedilla would probably be used very rarely, nor is it likely >that anyone would go to the trouble of designing a font to accomodate such >characters. Another cost of such generality would be that accents and other >marks would probably have to be indicated by escape sequences in conjunction >with the unmodified letter. This would make string-processing software more >complicated (and slower), and text would be longer. The existance of something like four different 94-character "Latin" sets suggests that 8 bits wouldn't suffice anyway, although I haven't counted the exact number of existing glyph combinations. I believe Welsh includes some strange accented consonants, but I don't remember which (maybe w^). If you also take Vietnamese into account, which allows several accents used at the same time, you'd definitely overflow the table. TeX provides for arbitrary combinations of accents, and I think this approach is quite simple (although I don't suggest TeX for the encoding scheme to be used for files in general). I don't think somebody manually has to design a font for each and every combination, as the acute accent over e looks pretty much the same as the acute accent over o, and the combination could be done automatically at display-time. Some characters will need special treatment though, like capital Swedish A with circle above (they should usually touch each other) and Polish bar-crossed L. The amount of programming and CPU power to be used for this depends on what quality and resolution of display you require. If this general approach turns out to be the most practical one technically, some people may of course go hog wild putting circles under X and cedillas over 7, but there is as little point in stopping them as in preventing people from writing "fiYw#s" with a proportional font. Just apply the general accent attachment rule and they'll be quiet... > if (coll[c] >= FIRST_CHAR && coll[c] <= LAST_CHAR) >with very little loss of efficiency. To accomodate perverse languages like >Spanish and Polish which insist on two-letter combinations for sorting, What about the thing "Mac" or "Mc" in English (Scottish?) proper names? I agree this example is a little extreme in comparison to the Spanish "graphemes" ch, ll and rr (?), as well as czech ch. Maybe the English don't mind seeing "McDonald" sorted after "Machiavelli", or whatever the rule is/was - has it been abolished by now? There are different kinds of sorting even within one language, depending on the context. Donald E. Knuth provides a wonderful collection of rules for bibliographic use in the beginning of his "Fundamentals ..." volume on Sorting & Searching, such as ignoring articles and spelling out numbers. These rules don't apply to filenames in a UNIX directory, I think! -- Anders Andersson, Dept. of Computer Systems, Uppsala University, Sweden Phone: +46 18 183170 UUCP: andersa@kuling.UUCP (...!{seismo,mcvax}!enea!kuling!andersa)
andersa@kuling.UUCP (Anders Andersson) (08/23/87)
In article <2201@enea.UUCP> sommar@enea.UUCP(Erland Sommarskog) writes: >> if (coll[c] >= FIRST_CHAR && coll[c] <= LAST_CHAR) > Of course I know about look-tables. I have myself written a programme >that uses a two-level look-up table for comparing words. (And the words >are transcribed in three levels. You don't want the hyphen in a hyphenated >word to be significant.) But to have that in every single programme that >does string comparisons. No, thank you. It does increase the complexity >and the readability of the code. Nobody coding an application should of course ever have to implement those string comparison routines explicitely over and over again, but rather refer to generec library routines, like isalpha(c). Several libraries can be provided for different kinds of sorting (if you want "RFC666.TXT" after "RFC-INDEX.TXT" but "RFC888.TXT" before because of the way numbers are spelled out, that's up to your choice). If the string comparison library is too big to be linked into your 42 executables, then provide it as a sharable image, or (in an emergency) put it in your favourite kernel... -- Anders Andersson, Dept. of Computer Systems, Uppsala University, Sweden Phone: +46 18 183170 UUCP: andersa@kuling.UUCP (...!{seismo,mcvax}!enea!kuling!andersa)
henry@utzoo.UUCP (Henry Spencer) (08/23/87)
> I haven't seen anyone mention an ISO standard yet. I was under the impression > that there was one. Am I wrong? I don't much care for the alternates > that I have seen used by terminal manufacturers in the US which is a > keyboard with many of the special symbols replaced with accented characters. Unfortunately, this *is* the (old) ISO standard. Seven or eight of the special symbols in ASCII are in positions which the ISO 7-bit standard designates as "reserved for national alphabets", or words to that effect. ASCII, of course, doesn't need any extra national-alphabet symbols, so it filled those positions with neat but ASCII-specific things. The mess that results from this was a major motivation for the new ISO Latin standard, which is an 8-bit character set that includes all of ASCII plus some extra goodies plus pretty well everything needed to write the Latin-derived languages. ISO Latin is unquestionably the wave of the future -- it will help a lot and won't hurt much. It WILL hurt a little (for example, there aren't many Unix programs that use the top bit of char for something else, but those few are exactly the programs that one least wants to modify: the editors and the shells!), but it won't be a tenth as painful as the more drastic changes needed to do seriously non-Latin alphabets like Chinese and Arabic. My own personal view is that ISO Latin is a Good Thing, I am planning my software for it, and everybody else should too. The various proposals for dealing with the non-Latin alphabets, on the other hand, all seem to me to have rather higher price tags, and I take a "wait and see" attitude toward them. -- Apollo was the doorway to the stars. | Henry Spencer @ U of Toronto Zoology Next time, we should open it. | {allegra,ihnp4,decvax,utai}!utzoo!henry
gordan@maccs.UUCP (Gordan Palameta) (08/25/87)
In article <8462@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: > >My own personal view is that ISO Latin is a Good Thing, I am planning my >software for it, and everybody else should too. Yay. Hear that, programmers? When you write the Next Great Program (that slices, dices, and upholsters furniture), DON'T TOUCH the eighth bit! This means YOU! > The various proposals >for dealing with the non-Latin alphabets, on the other hand, all seem to >me to have rather higher price tags, and I take a "wait and see" attitude >toward them. Ummm, Arabic, Hebrew, Greek, and Cyrillic are or will shortly be taken care of by the same standardization process that produced ISO Latin-1. Each uses a different upper half of the character set. I think there's even standard escape sequences suggested for switching between the different ISO character sets, for terminals capable of displaying more than one such set. On a different note, the first 32 positions of the upper half of the character set are supposed to be reserved for a bunch of new non-printing characters. On page 26 of August 1985 BYTE, a preliminary version of ISO Latin 1 is listed, and some of these "control" characters have names, e.g. 08/04 = IND, 08/05 = NEL, etc. It is implied in the accompanying letter that these are intended for word-processing commands. Have the uses of these been standardized? If so, it would certainly seem worth publicizing and discussing here.
henry@utzoo.UUCP (Henry Spencer) (08/25/87)
> Ummm, Arabic, Hebrew, Greek, and Cyrillic are or will shortly be taken care of > by the same standardization process that produced ISO Latin-1. Each uses a > different upper half of the character set... Unfortunately, this brings us back to the old problem that the meaning of a byte is context-dependent. There were alternate character sets for the Latin languages before, and standard escape sequences for switching; much good it did us. Anything with mode-switching is an order of magnitude harder to handle intelligently than a modeless code like ASCII or ISO Latin. Don't forget the right-to-left problems in Arabic and Hebrew, for that matter. I don't know what the best answer is, and am not convinced that anyone else does either. Hence "wait and see". My sympathy goes out to the people who have compelling commercial reasons to do something about these issues now; it can't be much fun. -- "There's a lot more to do in space | Henry Spencer @ U of Toronto Zoology than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry
daveb@geac.UUCP (Brown) (08/25/87)
In article <737@maccs.UUCP> gordan@maccs.UUCP (Gordan Palameta) writes: >On a different note, the first 32 positions of the upper half of the character >set are supposed to be reserved for a bunch of new non-printing characters. >Have the uses of these been standardized? If so, it would certainly seem >worth publicizing and discussing here. I cannot find the names in my existing ISO docuumentation, could some kind person please post a chart of these? --dave (diluting my ignorance) collier-brown ps: I have "ISO 2022 Information processing -- ISO 7-bit and 8-bit coded character sets -- Code extension techniques", Second edition - 1982-12-15, but id doesn't name the characters, just says they're there. -- David Collier-Brown. {mnetor|yetti|utgpu}!geac!daveb Geac Computers International Inc., | Computer Science loses its 350 Steelcase Road,Markham, Ontario, | memory (if not its mind) CANADA, L3R 1B3 (416) 475-0525 x3279 | every 6 months.
wcs@ho95e.ATT.COM (Bill.Stewart) (08/29/87)
In article <718@maccs.UUCP> gordan@maccs.UUCP (Gordan Palameta) writes: :>In article <8410@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: :>) [if the French had invented computers, ........ :If the French had invented computers, there is little doubt that a character :set to support French would have appeared sooner. ..... :There is no chance that a 16-bit character set would have sprung up, fully :formed -- computer memory used to be very, very expensive (weren't characters :six bits, once upon a time?) It would be variable length, from 5 to 14 bits ... but the last three or four aren't pronounced :-). More seriously, while a given numeric character representation doesn't correspond identically to the collating sequence, (viz. English [Aa]<[Bb]... vs ascii or ebcdic), one can build a table listing character representations in sequence, and use it for sorting rather than building the sequence into the language, as was suggested with #pragma language(Franglais). While one might use the representation directly when the collating sequence doesn't really matter (e.g. building unique lists), one can still build library functions to compare words. -- # Thanks; # Bill Stewart, AT&T Bell Labs 2G218, Holmdel NJ 1-201-949-0705 ihnp4!ho95c!wcs
sommar@enea.UUCP (09/06/87)
It is now some weeke ago since I wrote an article where I asked for a change of paradigm for character representation. It was followed about discussion of sorting and what languages which used what acccents. The last days some new ideas have been coming up, though I'd like to comment these. In a separate article I am presenting an own proposal to a character standard. Basically we have seen two approaches to the problem. Alan Lovejoy's idea of a character palette and ISO 6937. The palette first. The idea has its points, but I feel it is over-worked. Do you really need codes for all possible human sounds? Computers today transmits written language, not spoken. But with speech synthesis advancing, we may need this standard in some decades. Alan hasn't spoken of sorting and character comparison. Just comparing the (arbitrarily) numeric codes doens't seem meaningful. Have you thought of introduce language dependicy here? So ISO 6937. Until Bruce Sherwood wrote his article I hadn't heard of this standard. Obviuosly this standard does a lot of what I requested. By introducing mute modifiers a lot more letters can be handled. But apparently it was ahead of its times. Fridrik Skulason writes: "6937 may be better than 8859 for some purposes (communication that is), but as a standard character set for terminals it is useless. The reason ... Simple. Most existing software packages assume that (1 char in text = 1 char on screen)." That is giving up, I'd say. Yes, ISO 6937 would require many existing programmes being rewritten. It would also require progress in hardware for handle the mute characters properly. But can't do we this? If Fridrik is right we are doomed to live the ASCII/8859 stone age approach forever. Now I think he is wrong. But of course you have work more for a "progressive" standard to gain popularity than a defensive one like 8859. What you need is some leading manufacturer to start using it, or an important customer to require it. -- Erland Sommarskog ENEA Data, Stockholm sommar@enea.UUCP