[sci.lang] Character representation

pom@under..ARPA (Peter O. Mikes) (08/19/87)

To: gordan@maccs.UUCP
Subject: Re: Character representation
Newsgroups: comp.std.internat, sci.lang
In-Reply-To: <719@maccs.UUCP>

In article <719@maccs.UUCP> you write:
                            >Followups to alt.universes.
  I am sorry, but according to latest QM, the multiple universes not
  only keep splitting, they also merge. This happens to be one such
 feedback from Alternative Universe. Besides I have VERY CONSTRUCTIVE
(  insiders info ) FACT on ceddilas, umlats, haceks, and other such ...
 [modifiers] namely : In all langauges I know, there are many kinds,
 but ANY PARTICULAR LETTER either has one - or it does not. That means
 that we need to reserve just 1 bit ( 0.. unmodified) and (1.. modified).
  to take care of dozens of languages.
 So e.g. if switch  ( ROM, printwheel,..) is set to German , modified o will
 put two dots ( umlaut) above o; In Czech the same bit will put ' above
  'aeiou' but will put inverted ^ over consonants  ( since only 'aeiou'
  are allowed to have  '  and only consonants can  have ^, and so it goes.

>>But this doesn't
>>address all problems I mentioned. How to construct a general character
>>with an arbitrary accent, umlaut or other diacritic mark? An 8-bit
>>enumarate isn't sufficient.

 The problem you  (somebody) mentioned is hereby addressed.
 To disprove my conjecture, name one language with Latin-based alphabet 
 and one letter in that alphabet, which admits more then one modifier.

   Oh, just BTW - using poor ASCII, which has no modifier bit, I am
   using the convention that modifier is indictaed by h ( e.g. 
  a word:  (modified_s)ot would appear as shot. (which is quite wastefull
  as whole h is needed to perform function of one bit).
 
 ( I am not quite sure if all mono-anglo-phones realise that english is
 actually using pairs for sounds ( english sh  is perverse Hungarian's sz
 is actualy one sound (soft s or s^). The difference  is mostly in that
 english is ambiguous and arbitrary and (on the positive side) makes
 collating based on singles ( but anybody can accept that, since you
 get your pairs sz - sorted in same sequence (almost) always anyway.

.  There is no reason why we couldn't use a huffman encoding
>scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th
>pattern is a filler, and the 16th pattern means that the next byte
 
 In a related posting 
>--- David Phillip Oster            --My Good News: "I'm a perfectionist."
>Arpa: oster@dewey.soe.berkeley.edu --My Bad News: "I don't charge by the hour."
                  WRITES
> There is no reason why we couldn't use a huffman encoding
>scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th
>pattern is a filler, and the 16th pattern means that the next byte
>encodes the 254 next most common ideograms, the 255 bit pattern
    ...............................
>this idea would also work for English. Assuming that the average
>English word takes 6*8 bits (average length of 5 + terminating space
>* 8 bit ascii) you could cut the disk space required for computer..
   and I SAY, there is a reason: 	
  I would like to propose a criterion  for ( or attribute of) coding of
 text. Coding  is LOCAL (within n) if from each 3n bytes I can derive
 one (middle) letter of the encoded text. In this sense, the  coding
 based on pairs (polish, spanish, sh for s^ etc are all local (within 2)
 but coding based on frequency of words is not (beside being language
 dependent).  ( Please recall that I consider ideographs to be 'words' made
 of strokes.)
 The coding based on frequency of characters is Local, (and if we accept
 the above explained modifier-bit convention) also Language independent.

 I do believe that since we are discussing CHARACTER sets - we should 
 leave out the coding based an dictionaries (word sets) - they have their
 funnction - but are much more (application, language, etc ) dependent
 than the character sets. Lets reach some agreement on letters first.
 

	Yours  Dr. pom  -  a scientist  -   (quite mad)   pom@under.s1.gov

rob@pbhye.UUCP (Rob Bernardo) (08/19/87)

In article <15381@mordor.s1.gov> pom@s1-under.UUCP () writes:
+(  insiders info ) FACT on ceddilas, umlats, haceks, and other such ...
+ [modifiers] namely : In all langauges I know, there are many kinds,
+ but ANY PARTICULAR LETTER either has one - or it does not. That means
+ that we need to reserve just 1 bit ( 0.. unmodified) and (1.. modified).
+  to take care of dozens of languages.

Not so.
	French allows ' and ^ over all vowels and ` additionally over "e".

	Hungarian allows two dots, ', and ''  over "u" and "o".

	Spanish allows ' and two dots over "u".


Just to name a few off the top of my head. :-)
-- 
I'm not a bug, I'm a feature.
Rob Bernardo, San Ramon, CA	(415) 823-2417	{pyramid|ihnp4|dual}!ptsfa!rob

sandi@apollo.uucp (Sandra Martin) (08/19/87)

Peter O. Mikes @ S-1 Project, LLNL writes:
> So e.g. if switch  ( ROM, printwheel,..) is set to German , modified o will
> put two dots ( umlaut) above o; In Czech the same bit will put ' above
>  'aeiou' but will put inverted ^ over consonants  ( since only 'aeiou'
>  are allowed to have  '  and only consonants can  have ^, and so it goes.
>
>>But this doesn't
>>address all problems I mentioned. How to construct a general character
>>with an arbitrary accent, umlaut or other diacritic mark? An 8-bit
>>enumarate isn't sufficient.
>
> The problem you  (somebody) mentioned is hereby addressed.
> To disprove my conjecture, name one language with Latin-based alphabet 
> and one letter in that alphabet, which admits more then one modifier.

French allows an 'e' to take an acute or grave accent, as well as
a circumflex and an umlaut. Examples:

    e'cole      (school)
    privile`ge  (privilege)
    e^tre       (to be)
    Noe"l       (Christmas)

The 'a', 'i', and 'u' in French also can take multiple diacriticals. And in
Swedish, the 'a' can take a ring or an umlaut. I imagine there are other
examples, but these are the ones I could think of off the top of my head.

   Sandra Martin, Apollo Computer
   UUCP:  ...{mit-erl,mit-eddie,yale,uw-beaver,decvax!wanginst}!apollo!sandi
   ARPA:  apollo!sandi@eddie.mit.edu

joe@haddock.ISC.COM (Joe Chapman) (08/19/87)

>Besides I have VERY CONSTRUCTIVE
>(  insiders info ) FACT on ceddilas, umlats, haceks, and other such ...
> [modifiers] namely : In all langauges I know, there are many kinds,
> but ANY PARTICULAR LETTER either has one - or it does not.
> To disprove my conjecture, name one language with Latin-based alphabet 
> and one letter in that alphabet, which admits more then one modifier.

I don't even have to get obscure: in French an ``e'' can take one of
three accents (grave, acute, circumflex) or a diaeresis.

> english sh  is perverse Hungarian's sz
> is actualy one sound (soft s or s^).

Minor quibble: in Hungarian, sz is pronounced like the s in English
"soup", and s is pronounced as in "shop".

--
Joe Chapman
harvard!ima!joe

scottha@athena.TEK.COM (Scott Hankerson) (08/19/87)

In article <15381@mordor.s1.gov> pom@s1-under.UUCP () writes:
>>>But this doesn't
>>>address all problems I mentioned. How to construct a general character
>>>with an arbitrary accent, umlaut or other diacritic mark? An 8-bit
>>>enumarate isn't sufficient.
>
> The problem you  (somebody) mentioned is hereby addressed.
> To disprove my conjecture, name one language with Latin-based alphabet 
> and one letter in that alphabet, which admits more then one modifier.

In French, one can have up to four different modifiers over a particular
letter (the letter e can have a grave accent, one going the other direction
(I never can remember what they're called in English (')), a circumflex,
or an umlaut.  In addition, I may want to quote from other languages if
I write in French.

>   Oh, just BTW - using poor ASCII, which has no modifier bit, I am
>   using the convention that modifier is indictaed by h ( e.g. 
>  a word:  (modified_s)ot would appear as shot. (which is quite wastefull
>  as whole h is needed to perform function of one bit).

Surely this would introduce even more ambiguities.  In German, an h 
lengthens the vowels.  Is a vowel followed by an h an umlauted vowel?
An umauted vowel followed by an h? Or simply a vowel followed by an h?

I haven't seen anyone mention an ISO standard yet.  I was under the impression
that there was one.  Am I wrong?  I don't much care for the alternates
that I have seen used by terminal manufacturers in the US which is a
keyboard with many of the special symbols replaced with accented characters.
That may be nice for writting documents, but it must be intollerable for
coding in C or any other programming language which uses many nonalphabetic
symbols.

alan@pdn.UUCP (Alan Lovejoy) (08/20/87)

In article <15381@mordor.s1.gov> pom@s1-under.UUCP () writes:
> To disprove my conjecture, name one language with Latin-based alphabet 
> and one letter in that alphabet, which admits more then one modifier.

French:  "e" can have either a left overstrike, a right overstrike or
a hat "^".  Admittedly, only the right overstrike changes the
pronunciation in MODERN Parisian French.  However, in Navajo vowels
must simultaneously be markable for nasality by a cedilla, as well as
by diacritical marks to indicate other variations.  

All vowels can be nasal/nonnasal and voiced/unvoiced, and I believe
there are exotic languages with even other variations (I'll have to
look that up, though).    

Things get even stickier when you consider the problem of multilingual
text, however (e.g, "I said to him, 'Who are you'?  To which he
answered, 'Je ne parle pas anglais.  Parlez-vous francais'?").

--Alan "Bozhe moi! Kommitjet Gosudarstvjenoj Bjezopastnostji
sljedujet...!!!!" Lovejoy

dean@hyper.UUCP (Dean Gahlon) (08/20/87)

in article <15381@mordor.s1.gov], pom@under..ARPA (Peter O. Mikes) says:
] 
]  The problem you  (somebody) mentioned is hereby addressed.
]  To disprove my conjecture, name one language with Latin-based alphabet 
]  and one letter in that alphabet, which admits more then one modifier.
] [random stuff deleted] 
]  
]  ( I am not quite sure if all mono-anglo-phones realise that english is
]  actually using pairs for sounds ( english sh  is perverse Hungarian's sz
]  is actualy one sound (soft s or s^). 

	You mentioned it yourself - Hungarian. (O, for instance, has 
both long and short umlauts).

sommar@enea.UUCP (Erland Sommarskog) (08/21/87)

This was intented to go by mail, but it came back to me. (Athena was
an unknown host.)

In article <1583@athena.TEK.COM> you write:
>I haven't seen anyone mention an ISO standard yet.  I was under the impression
>that there was one.  Am I wrong?  

You must have missed it. Tim Lasko from DEC wrote an article on the
status of the eight ISO standards. That was in comp.std.internat some
weeks ago.

The ISO standards are not sufficient. I don't whether you have read
my articles in comp.std.internat where I discussed the need for 
another concept for character represenation. I find it quite
inconvient to find the end of my alphabet somewhere at code 200.

>                                  I don't much care for the alternates
>that I have seen used by terminal manufacturers in the US which is a
>keyboard with many of the special symbols replaced with accented characters.
>That may be nice for writting documents, but it must be intollerable for
>coding in C or any other programming language which uses many nonalphabetic
>symbols.

Having screens with national characters replacing barckets and
braces is no problem. Many languages that uses these characters
allows alternatives. (E.g. [] can be replaced by (..) in most Pascal
dialects.) Languages that does not provide alternatives, I simply 
refuse to use. 
-- 

Erland Sommarskog       
ENEA Data, Stockholm    
sommar@enea.UUCP

henry@utzoo.UUCP (Henry Spencer) (08/23/87)

> I haven't seen anyone mention an ISO standard yet.  I was under the impression
> that there was one.  Am I wrong?  I don't much care for the alternates
> that I have seen used by terminal manufacturers in the US which is a
> keyboard with many of the special symbols replaced with accented characters.

Unfortunately, this *is* the (old) ISO standard.  Seven or eight of the
special symbols in ASCII are in positions which the ISO 7-bit standard
designates as "reserved for national alphabets", or words to that effect.
ASCII, of course, doesn't need any extra national-alphabet symbols, so
it filled those positions with neat but ASCII-specific things.

The mess that results from this was a major motivation for the new ISO
Latin standard, which is an 8-bit character set that includes all of ASCII
plus some extra goodies plus pretty well everything needed to write the
Latin-derived languages.  ISO Latin is unquestionably the wave of the
future -- it will help a lot and won't hurt much.  It WILL hurt a little
(for example, there aren't many Unix programs that use the top bit of char
for something else, but those few are exactly the programs that one least
wants to modify:  the editors and the shells!), but it won't be a tenth
as painful as the more drastic changes needed to do seriously non-Latin
alphabets like Chinese and Arabic.

My own personal view is that ISO Latin is a Good Thing, I am planning my
software for it, and everybody else should too.  The various proposals
for dealing with the non-Latin alphabets, on the other hand, all seem to
me to have rather higher price tags, and I take a "wait and see" attitude
toward them.
-- 
Apollo was the doorway to the stars. |  Henry Spencer @ U of Toronto Zoology
Next time, we should open it.        | {allegra,ihnp4,decvax,utai}!utzoo!henry