pom@under..ARPA (Peter O. Mikes) (08/19/87)
To: gordan@maccs.UUCP Subject: Re: Character representation Newsgroups: comp.std.internat, sci.lang In-Reply-To: <719@maccs.UUCP> In article <719@maccs.UUCP> you write: >Followups to alt.universes. I am sorry, but according to latest QM, the multiple universes not only keep splitting, they also merge. This happens to be one such feedback from Alternative Universe. Besides I have VERY CONSTRUCTIVE ( insiders info ) FACT on ceddilas, umlats, haceks, and other such ... [modifiers] namely : In all langauges I know, there are many kinds, but ANY PARTICULAR LETTER either has one - or it does not. That means that we need to reserve just 1 bit ( 0.. unmodified) and (1.. modified). to take care of dozens of languages. So e.g. if switch ( ROM, printwheel,..) is set to German , modified o will put two dots ( umlaut) above o; In Czech the same bit will put ' above 'aeiou' but will put inverted ^ over consonants ( since only 'aeiou' are allowed to have ' and only consonants can have ^, and so it goes. >>But this doesn't >>address all problems I mentioned. How to construct a general character >>with an arbitrary accent, umlaut or other diacritic mark? An 8-bit >>enumarate isn't sufficient. The problem you (somebody) mentioned is hereby addressed. To disprove my conjecture, name one language with Latin-based alphabet and one letter in that alphabet, which admits more then one modifier. Oh, just BTW - using poor ASCII, which has no modifier bit, I am using the convention that modifier is indictaed by h ( e.g. a word: (modified_s)ot would appear as shot. (which is quite wastefull as whole h is needed to perform function of one bit). ( I am not quite sure if all mono-anglo-phones realise that english is actually using pairs for sounds ( english sh is perverse Hungarian's sz is actualy one sound (soft s or s^). The difference is mostly in that english is ambiguous and arbitrary and (on the positive side) makes collating based on singles ( but anybody can accept that, since you get your pairs sz - sorted in same sequence (almost) always anyway. . There is no reason why we couldn't use a huffman encoding >scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th >pattern is a filler, and the 16th pattern means that the next byte In a related posting >--- David Phillip Oster --My Good News: "I'm a perfectionist." >Arpa: oster@dewey.soe.berkeley.edu --My Bad News: "I don't charge by the hour." WRITES > There is no reason why we couldn't use a huffman encoding >scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th >pattern is a filler, and the 16th pattern means that the next byte >encodes the 254 next most common ideograms, the 255 bit pattern ............................... >this idea would also work for English. Assuming that the average >English word takes 6*8 bits (average length of 5 + terminating space >* 8 bit ascii) you could cut the disk space required for computer.. and I SAY, there is a reason: I would like to propose a criterion for ( or attribute of) coding of text. Coding is LOCAL (within n) if from each 3n bytes I can derive one (middle) letter of the encoded text. In this sense, the coding based on pairs (polish, spanish, sh for s^ etc are all local (within 2) but coding based on frequency of words is not (beside being language dependent). ( Please recall that I consider ideographs to be 'words' made of strokes.) The coding based on frequency of characters is Local, (and if we accept the above explained modifier-bit convention) also Language independent. I do believe that since we are discussing CHARACTER sets - we should leave out the coding based an dictionaries (word sets) - they have their funnction - but are much more (application, language, etc ) dependent than the character sets. Lets reach some agreement on letters first. Yours Dr. pom - a scientist - (quite mad) pom@under.s1.gov
rob@pbhye.UUCP (Rob Bernardo) (08/19/87)
In article <15381@mordor.s1.gov> pom@s1-under.UUCP () writes:
+( insiders info ) FACT on ceddilas, umlats, haceks, and other such ...
+ [modifiers] namely : In all langauges I know, there are many kinds,
+ but ANY PARTICULAR LETTER either has one - or it does not. That means
+ that we need to reserve just 1 bit ( 0.. unmodified) and (1.. modified).
+ to take care of dozens of languages.
Not so.
French allows ' and ^ over all vowels and ` additionally over "e".
Hungarian allows two dots, ', and '' over "u" and "o".
Spanish allows ' and two dots over "u".
Just to name a few off the top of my head. :-)
--
I'm not a bug, I'm a feature.
Rob Bernardo, San Ramon, CA (415) 823-2417 {pyramid|ihnp4|dual}!ptsfa!rob
sandi@apollo.uucp (Sandra Martin) (08/19/87)
Peter O. Mikes @ S-1 Project, LLNL writes: > So e.g. if switch ( ROM, printwheel,..) is set to German , modified o will > put two dots ( umlaut) above o; In Czech the same bit will put ' above > 'aeiou' but will put inverted ^ over consonants ( since only 'aeiou' > are allowed to have ' and only consonants can have ^, and so it goes. > >>But this doesn't >>address all problems I mentioned. How to construct a general character >>with an arbitrary accent, umlaut or other diacritic mark? An 8-bit >>enumarate isn't sufficient. > > The problem you (somebody) mentioned is hereby addressed. > To disprove my conjecture, name one language with Latin-based alphabet > and one letter in that alphabet, which admits more then one modifier. French allows an 'e' to take an acute or grave accent, as well as a circumflex and an umlaut. Examples: e'cole (school) privile`ge (privilege) e^tre (to be) Noe"l (Christmas) The 'a', 'i', and 'u' in French also can take multiple diacriticals. And in Swedish, the 'a' can take a ring or an umlaut. I imagine there are other examples, but these are the ones I could think of off the top of my head. Sandra Martin, Apollo Computer UUCP: ...{mit-erl,mit-eddie,yale,uw-beaver,decvax!wanginst}!apollo!sandi ARPA: apollo!sandi@eddie.mit.edu
joe@haddock.ISC.COM (Joe Chapman) (08/19/87)
>Besides I have VERY CONSTRUCTIVE >( insiders info ) FACT on ceddilas, umlats, haceks, and other such ... > [modifiers] namely : In all langauges I know, there are many kinds, > but ANY PARTICULAR LETTER either has one - or it does not. > To disprove my conjecture, name one language with Latin-based alphabet > and one letter in that alphabet, which admits more then one modifier. I don't even have to get obscure: in French an ``e'' can take one of three accents (grave, acute, circumflex) or a diaeresis. > english sh is perverse Hungarian's sz > is actualy one sound (soft s or s^). Minor quibble: in Hungarian, sz is pronounced like the s in English "soup", and s is pronounced as in "shop". -- Joe Chapman harvard!ima!joe
scottha@athena.TEK.COM (Scott Hankerson) (08/19/87)
In article <15381@mordor.s1.gov> pom@s1-under.UUCP () writes: >>>But this doesn't >>>address all problems I mentioned. How to construct a general character >>>with an arbitrary accent, umlaut or other diacritic mark? An 8-bit >>>enumarate isn't sufficient. > > The problem you (somebody) mentioned is hereby addressed. > To disprove my conjecture, name one language with Latin-based alphabet > and one letter in that alphabet, which admits more then one modifier. In French, one can have up to four different modifiers over a particular letter (the letter e can have a grave accent, one going the other direction (I never can remember what they're called in English (')), a circumflex, or an umlaut. In addition, I may want to quote from other languages if I write in French. > Oh, just BTW - using poor ASCII, which has no modifier bit, I am > using the convention that modifier is indictaed by h ( e.g. > a word: (modified_s)ot would appear as shot. (which is quite wastefull > as whole h is needed to perform function of one bit). Surely this would introduce even more ambiguities. In German, an h lengthens the vowels. Is a vowel followed by an h an umlauted vowel? An umauted vowel followed by an h? Or simply a vowel followed by an h? I haven't seen anyone mention an ISO standard yet. I was under the impression that there was one. Am I wrong? I don't much care for the alternates that I have seen used by terminal manufacturers in the US which is a keyboard with many of the special symbols replaced with accented characters. That may be nice for writting documents, but it must be intollerable for coding in C or any other programming language which uses many nonalphabetic symbols.
alan@pdn.UUCP (Alan Lovejoy) (08/20/87)
In article <15381@mordor.s1.gov> pom@s1-under.UUCP () writes: > To disprove my conjecture, name one language with Latin-based alphabet > and one letter in that alphabet, which admits more then one modifier. French: "e" can have either a left overstrike, a right overstrike or a hat "^". Admittedly, only the right overstrike changes the pronunciation in MODERN Parisian French. However, in Navajo vowels must simultaneously be markable for nasality by a cedilla, as well as by diacritical marks to indicate other variations. All vowels can be nasal/nonnasal and voiced/unvoiced, and I believe there are exotic languages with even other variations (I'll have to look that up, though). Things get even stickier when you consider the problem of multilingual text, however (e.g, "I said to him, 'Who are you'? To which he answered, 'Je ne parle pas anglais. Parlez-vous francais'?"). --Alan "Bozhe moi! Kommitjet Gosudarstvjenoj Bjezopastnostji sljedujet...!!!!" Lovejoy
dean@hyper.UUCP (Dean Gahlon) (08/20/87)
in article <15381@mordor.s1.gov], pom@under..ARPA (Peter O. Mikes) says:
]
] The problem you (somebody) mentioned is hereby addressed.
] To disprove my conjecture, name one language with Latin-based alphabet
] and one letter in that alphabet, which admits more then one modifier.
] [random stuff deleted]
]
] ( I am not quite sure if all mono-anglo-phones realise that english is
] actually using pairs for sounds ( english sh is perverse Hungarian's sz
] is actualy one sound (soft s or s^).
You mentioned it yourself - Hungarian. (O, for instance, has
both long and short umlauts).
sommar@enea.UUCP (Erland Sommarskog) (08/21/87)
This was intented to go by mail, but it came back to me. (Athena was an unknown host.) In article <1583@athena.TEK.COM> you write: >I haven't seen anyone mention an ISO standard yet. I was under the impression >that there was one. Am I wrong? You must have missed it. Tim Lasko from DEC wrote an article on the status of the eight ISO standards. That was in comp.std.internat some weeks ago. The ISO standards are not sufficient. I don't whether you have read my articles in comp.std.internat where I discussed the need for another concept for character represenation. I find it quite inconvient to find the end of my alphabet somewhere at code 200. > I don't much care for the alternates >that I have seen used by terminal manufacturers in the US which is a >keyboard with many of the special symbols replaced with accented characters. >That may be nice for writting documents, but it must be intollerable for >coding in C or any other programming language which uses many nonalphabetic >symbols. Having screens with national characters replacing barckets and braces is no problem. Many languages that uses these characters allows alternatives. (E.g. [] can be replaced by (..) in most Pascal dialects.) Languages that does not provide alternatives, I simply refuse to use. -- Erland Sommarskog ENEA Data, Stockholm sommar@enea.UUCP
henry@utzoo.UUCP (Henry Spencer) (08/23/87)
> I haven't seen anyone mention an ISO standard yet. I was under the impression > that there was one. Am I wrong? I don't much care for the alternates > that I have seen used by terminal manufacturers in the US which is a > keyboard with many of the special symbols replaced with accented characters. Unfortunately, this *is* the (old) ISO standard. Seven or eight of the special symbols in ASCII are in positions which the ISO 7-bit standard designates as "reserved for national alphabets", or words to that effect. ASCII, of course, doesn't need any extra national-alphabet symbols, so it filled those positions with neat but ASCII-specific things. The mess that results from this was a major motivation for the new ISO Latin standard, which is an 8-bit character set that includes all of ASCII plus some extra goodies plus pretty well everything needed to write the Latin-derived languages. ISO Latin is unquestionably the wave of the future -- it will help a lot and won't hurt much. It WILL hurt a little (for example, there aren't many Unix programs that use the top bit of char for something else, but those few are exactly the programs that one least wants to modify: the editors and the shells!), but it won't be a tenth as painful as the more drastic changes needed to do seriously non-Latin alphabets like Chinese and Arabic. My own personal view is that ISO Latin is a Good Thing, I am planning my software for it, and everybody else should too. The various proposals for dealing with the non-Latin alphabets, on the other hand, all seem to me to have rather higher price tags, and I take a "wait and see" attitude toward them. -- Apollo was the doorway to the stars. | Henry Spencer @ U of Toronto Zoology Next time, we should open it. | {allegra,ihnp4,decvax,utai}!utzoo!henry