[comp.std.internat] Character representation

sommar@enea.UUCP (Erland Sommarskog) (08/12/87)

Two things have inspired me to this article:
1) The reading of the (proposed) standards ISO/Latin 1-4.
2) The discussion "What is a byte".
Reading the standards you discover that there is a whole lot of
letters you never dreamt of, but still there is something common.
(I'm only talking latin letters, but it applies to Greek and Cyrllian
as well.) With some few exceptions it is the same letters that 
reappears, they are just modified in some ways. They have accents,
cedillas, ring, dots, strokes etc. Thus, many are combinations of
two or more characters.
  
The standards is an attempt to satisfy the requirements for the
different languages by assigning each combination an integer
value. But isn't a character a more complicated data type
than just a simple enumeration type? In some languages the
combination may constitute a new letter ("a" with ring and dots,
"o" with dots in Swedish), in other you can apply accents and
other signs without affecting the sorting. (E.g. French, Italian)
  I think that the simple represenatation for charcters is completely
due the dominating position of the English language in the computer
world. If computers had been invented in France the problem would
have been solved. (And if they had been Swedish, Englishmen would
have to accept "v" and "w" being equivalent.)
  The conclusion is that a more sofisticated approach muct be taken.
However, I must admit that I do not have any bright proposals right
now, yet think of it!-- 

Erland Sommarskog       
ENEA Data, Stockholm    
sommar@enea.UUCP        

gordan@maccs.UUCP (Gordan Palameta) (08/13/87)

In article <2171@enea.UUCP> sommar@enea.UUCP(Erland Sommarskog) writes:
>
>value. But isn't a character a more complicated data type
>than just a simple enumeration type? In some languages the
>combination may constitute a new letter ("a" with ring and dots,
>"o" with dots in Swedish), in other you can apply accents and
>other signs without affecting the sorting. (E.g. French, Italian)
>  I think that the simple represenatation for charcters is completely
>due the dominating position of the English language in the computer
>world. If computers had been invented in France the problem would
>have been solved. (And if they had been Swedish, Englishmen would

It gets even more complicated:  in Spanish, I believe, ch is considered
a separate letter, between c and d in alphabetical order (likewise with ll).

It only goes to show that alphabetical order is language-dependent, and
identical strings will sort differently depending on locale.  The only
general solution is to have intelligent operating system routines to
handle sorting.

Despite 7-bit ASCII, which makes possible code such as
   if (c >= 'A' && c <= 'Z')
there is no reason why the numeric representation of a character should have
anything to do with the position of that character in a collating sequence.

-- 
UUCP:  ... !mnetor!lsuc!maccs!gordan              BITNET: GP@TANDEM
"Sumasshedshii vsekh stran, soyedinyaites'"        Gordan Palameta

henry@utzoo.UUCP (Henry Spencer) (08/14/87)

>   I think that the simple represenatation for charcters is completely
> due the dominating position of the English language in the computer
> world. If computers had been invented in France the problem would
> have been solved...

Surely you jest!  If the French had invented computers, the official
rationale for FRSCII (or whatever it would be called :-)) would go to
great lengths to explain why cedillas were part of God's plan but no
civilized human being would ever put pairs of dots above letters!
The best summary I've ever heard was "those who speak English don't
really believe that other languages exist; those who speak French know
that other languages exist, but can't understand why".
-- 
Support sustained spaceflight: fight |  Henry Spencer @ U of Toronto Zoology
the soi-disant "Planetary Society"!  | {allegra,ihnp4,decvax,utai}!utzoo!henry

sommar@enea.UUCP (08/15/87)

In a recent article gordan@maccs.UUCP (Gordan Palameta) writes:
>In article <2171@enea.UUCP> sommar@enea.UUCP(Erland Sommarskog) writes:
>>  I think that the simple represenatation for charcters is completely
>>due the dominating position of the English language in the computer
>>world. If computers had been invented in France the problem would
>>have been solved. (And if they had been Swedish, Englishmen would
>
>It gets even more complicated:  in Spanish, I believe, ch is considered
>a separate letter, between c and d in alphabetical order (likewise with ll).

Perfectly true. And Spanish is not unique. Polish, for instanc, have cz,
dz and rz.

>It only goes to show that alphabetical order is language-dependent, and
>identical strings will sort differently depending on locale.  The only
>general solution is to have intelligent operating system routines to
>handle sorting.

It would be preferably to have the sorting as part of the langauge
in question. For example in Ada:
   pragma LANGUAGE(French)
The support may be in the OS - or even the hardware for speed - but
making part of the language increases portability. But this doesn't
address all problems I mentioned. How to construct a general character
with an arbitrary accent, umlaut or other diacritic mark? An 8-bit
enumarate isn't sufficient.

>Despite 7-bit ASCII, which makes possible code such as
>   if (c >= 'A' && c <= 'Z')
>there is no reason why the numeric representation of a character should have
>anything to do with the position of that character in a collating sequence.

Right, but almost all programming today depends on it, isn't it so? 
It's easier to implement and executes faster. The character type should
be an abstract one. The actual implementation (bit size and all) could
vary from compiler to compiler, from OS to OS.
  The simple numeric representation happens to work for English. For
French it doesn't. If coumputers had been invented in France....

-- 

Erland Sommarskog       
ENEA Data, Stockholm    
sommar@enea.UUCP        

root@hobbes.UUCP (08/15/87)

+---- Erland Sommarskog writes the following in article <2171@enea.UUCP> ----
| In some languages the combination may constitute a new letter
| ("a" with ring and dots, "o" with dots in Swedish), in other you
| can apply accents and other signs without affecting the sorting.
| (E.g. French, Italian)
|   The conclusion is that a more sofisticated approach muct be taken.
| However, I must admit that I do not have any bright proposals right
| now, yet think of it!-- 
+----

I will be posting (as soon as I finish this note) a routine which we use
called stracmp() to the newsgroup comp.sources.misc.  It compares two
strings of 8 bit characters while taking into account the correct collating
sequence and precedence (if any) of accented letters.  The routine is
designed to drop in in place of the common strcmp().

The code has been used on IBM-PCs which support a limited set of accented
characters in their character display ROMS and also with the ISO-Latin-1
alphabet (this requires more sophisticated display drivers).  It would be
very simple to add the tables for Latin-2 through n if desired.  The code
is not dependent on any particular hardware, but it does assume the
C compiler handles "unsigned chars".

I am including some of the comments I made in the header to the code:

Description:
	stracmp() implements a string compare which correctly handles
	accented (non English) characters which have been encoded using
	8-bit characters.  It uses character lookup tables for doing 
	string compares when accented characters are present and/or a
	non-ASCII collating sequence is desired.
Theory:
	  The correct way of sorting (or comparing) strings which contain
	accented characters is to first compare the strings with all accents
	stripped. If the two strings are the same, then and only then are the
	accents used.  This second comparison involves only the accents.
	You can think of this as comparing the two strings with all the letters
	stripped.
	  Also, there are times when the "normal" ASCII collating sequence is
	not appropriate for lexical ordering.  (ie.  A <AE> B C <CEDILLA> D ...>
Examples:
			     ,  :
	Comparing Junta and Junta	(the second word has diacritical
					 marks over the two vowels)
	    first we compare("Junta", "Junta")	which shows them EQUAL
	then we must compare("     ", " '  :")
				  ,  :
	Thus, Junta comes before Junta in the lexical ordering of the two words.
		   ,          ,
	Comparing Junta  and Junto	(both words have accented 'u's)
	    first we compare("Junta", "Junto"); since they are
	different  we do not need to do anything more with the accents:
	  ,                    ,
	"Junta" is less than "Junto".
 
-- 
John Plocher uwvax!geowhiz!uwspan!plocher  plocher%uwspan.UUCP@uwvax.CS.WISC.EDU

lambert@mcvax.UUCP (08/15/87)

In article <8410@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
) Surely you jest!  If the French had invented computers, the official
) rationale for FRSCII (or whatever it would be called :-)) would go to
) great lengths to explain why cedillas were part of God's plan but no
) civilized human being would ever put pairs of dots above letters!

But the French do!  Best known example: <<Noe"l>> (Xmas) with a dieresis on
the <<e>>.  Other examples: <<contigue">> (female form of <<contigu>> =
contiguous), <<mai"s>> (corn), <<cycloi"de>>.

This by itself does not prove anything about the relationship between the
French and civilized human beings one way or the other, but it makes it
implausible that they would argue as suggested.

) Support sustained spaceflight: fight |  Henry Spencer @ U of Toronto Zoology

There I was believing you made these hilarious signature lines on purpose,
but now I see the ambiguity was probably unintentional.  I'll pipe the
output of my fight through you.

-- 

Lambert Meertens, CWI, Amsterdam; lambert@cwi.nl

gordan@maccs.UUCP (Gordan Palameta) (08/17/87)

In article <47@piring.cwi.nl> lambert@cwi.nl (Lambert Meertens) writes:
>In article <8410@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>) [if the French had invented computers, cedillas would be considered part
>) of "God's plan", but not diereses]
>
>But the French do!  [examples follow]

If the French had invented computers, there is little doubt that a character
set to support French would have appeared sooner.  Such a set might have
supported German as well, and other Western European languages, but it is
hardly likely that it would have supported (in decreasing order of
probability) Scandinavian languages, Eastern European languages, Japanese,
Icelandic eth(?) and thorn(?), Cyrillic, Arabic, Hebrew, etc.

There is no chance that a 16-bit character set would have sprung up, fully
formed -- computer memory used to be very, very expensive (weren't characters
six bits, once upon a time?)

In short, it is difficult to see how the situation would have evolved several
decades after the introduction of FRenchSCII to a point much different from 
what we have today (the ISO 8-bit ASCII sets, JIS standards for Kanji, etc.)

>) Support sustained spaceflight: fight |  Henry Spencer @ U of Toronto Zoology
>
> [humorous comment]

Support two line signatures, the money you save may be your own.

-- 
UUCP:  ... !mnetor!lsuc!maccs!gordan              BITNET: GP@TANDEM
"Sumasshedshii vsekh stran, soyedinyaites'"        Gordan Palameta

gordan@maccs.UUCP (Gordan Palameta) (08/17/87)

In article <2183@enea.UUCP> sommar@enea.UUCP(Erland Sommarskog) writes:
>In a recent article gordan@maccs.UUCP (Gordan Palameta) writes:
>>In article <2171@enea.UUCP> sommar@enea.UUCP(Erland Sommarskog) writes:
>
>But this doesn't
>address all problems I mentioned. How to construct a general character
>with an arbitrary accent, umlaut or other diacritic mark? An 8-bit
>enumarate isn't sufficient.

t umlaut or q cedilla would probably be used very rarely, nor is it likely
that anyone would go to the trouble of designing a font to accomodate such
characters.  Another cost of such generality would be that accents and other
marks would probably have to be indicated by escape sequences in conjunction
with the unmodified letter.  This would make string-processing software more
complicated (and slower), and text would be longer.

>>Despite 7-bit ASCII, which makes possible code such as
>>   if (c >= 'A' && c <= 'Z')
>>there is no reason why the numeric representation of a character should have
>>anything to do with the position of that character in a collating sequence.
>
>Right, but almost all programming today depends on it, isn't it so? 
>It's easier to implement and executes faster.

Not at all, just define a 256-byte lookup table in an include file, and modify
the code to
     if (coll[c] >= FIRST_CHAR && coll[c] <= LAST_CHAR)
with very little loss of efficiency.  To accomodate perverse languages like
Spanish and Polish which insist on two-letter combinations for sorting,
this won't do however: change the square brackets to round ones
(with some loss of efficiency, but very little inconvenience in coding).
(well some inconvenience, c can't be a single character any more).

Never mind the French; what if things had turned out differently in 1588
with the Armada, and the Spanish had invented computers?  Or the Chinese?

Followups to alt.universes.



-- 
UUCP:  ... !mnetor!lsuc!maccs!gordan              BITNET: GP@TANDEM
"Sumasshedshii vsekh stran, soyedinyaites'"        Gordan Palameta

frisk@askja.UUCP (08/18/87)

In article <176@hobbes.UUCP> root@hobbes.UUCP (John Plocher) writes:
>I will be posting (as soon as I finish this note) a routine which we use
>called stracmp() to the newsgroup comp.sources.misc.  It compares two
>strings of 8 bit characters while taking into account the correct collating
>sequence and precedence (if any) of accented letters.  The routine is
>designed to drop in in place of the common strcmp().

Now - the problem with this is that there is no "correct" collating
sequence for all languages. One example is the position of character 197
in Latin-1 (A with a circle above). In some languages it is the first
letter in the alphabet, in other one of the last.

The method described in this article would work partially with Icelandic,
but not quite. To see why, consider the Icelandic alphabet.

A  'A  B  D  (ETH)  E  'E  F  G  H  I  'I  J  K  L  M  N  O  'O  P  R  S  T
U  'U  V  X  Y  'Y (THORN) (AE) (o with two dots above)
  
Ot these 32 characters, the last two would end up in the wrong places.

What is needed is either:

    a strcpy(string1,string2,LANGUAGE) function

or

    a strcpy_LANGUAGE(string1,string2)
 

-- 
Fridrik Skulason  Univ. of Iceland, Computing Center
       UUCP  ...mcvax!hafro!askja!frisk                BIX  frisk

                     "This line intentionally left blank"

dan@sics.UUCP (08/18/87)

In article <176@hobbes.UUCP> root@hobbes.UUCP (John Plocher) writes:
>Theory:
>	  The correct way of sorting (or comparing) strings which contain
>	accented characters is to first compare the strings with all accents
>	stripped. If the two strings are the same, then and only then are the
>	accents used.  This second comparison involves only the accents.
>	You can think of this as comparing the two strings with all the letters
>	stripped.

Sorry, that is not the correct way to sort in Swedish.
The three letters a(with a ring), a(with two dots), o(with two dots)
always come at the end of the alphabet, after z.
Traditionally v and w are also considered the same letter in Swedish.

There is unfortunately no universal way to sort things alphabetically,
each language has its own ways. This fact has for instance been incorporated
into the Macintosh system, where there sorting depends on the national
version.

	Dan Sahlin              (dan@sics.uucp)

pom@under..ARPA (Peter O. Mikes) (08/19/87)

To: gordan@maccs.UUCP
Subject: Re: Character representation
Newsgroups: comp.std.internat, sci.lang
In-Reply-To: <719@maccs.UUCP>

In article <719@maccs.UUCP> you write:
                            >Followups to alt.universes.
  I am sorry, but according to latest QM, the multiple universes not
  only keep splitting, they also merge. This happens to be one such
 feedback from Alternative Universe. Besides I have VERY CONSTRUCTIVE
(  insiders info ) FACT on ceddilas, umlats, haceks, and other such ...
 [modifiers] namely : In all langauges I know, there are many kinds,
 but ANY PARTICULAR LETTER either has one - or it does not. That means
 that we need to reserve just 1 bit ( 0.. unmodified) and (1.. modified).
  to take care of dozens of languages.
 So e.g. if switch  ( ROM, printwheel,..) is set to German , modified o will
 put two dots ( umlaut) above o; In Czech the same bit will put ' above
  'aeiou' but will put inverted ^ over consonants  ( since only 'aeiou'
  are allowed to have  '  and only consonants can  have ^, and so it goes.

>>But this doesn't
>>address all problems I mentioned. How to construct a general character
>>with an arbitrary accent, umlaut or other diacritic mark? An 8-bit
>>enumarate isn't sufficient.

 The problem you  (somebody) mentioned is hereby addressed.
 To disprove my conjecture, name one language with Latin-based alphabet 
 and one letter in that alphabet, which admits more then one modifier.

   Oh, just BTW - using poor ASCII, which has no modifier bit, I am
   using the convention that modifier is indictaed by h ( e.g. 
  a word:  (modified_s)ot would appear as shot. (which is quite wastefull
  as whole h is needed to perform function of one bit).
 
 ( I am not quite sure if all mono-anglo-phones realise that english is
 actually using pairs for sounds ( english sh  is perverse Hungarian's sz
 is actualy one sound (soft s or s^). The difference  is mostly in that
 english is ambiguous and arbitrary and (on the positive side) makes
 collating based on singles ( but anybody can accept that, since you
 get your pairs sz - sorted in same sequence (almost) always anyway.

.  There is no reason why we couldn't use a huffman encoding
>scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th
>pattern is a filler, and the 16th pattern means that the next byte
 
 In a related posting 
>--- David Phillip Oster            --My Good News: "I'm a perfectionist."
>Arpa: oster@dewey.soe.berkeley.edu --My Bad News: "I don't charge by the hour."
                  WRITES
> There is no reason why we couldn't use a huffman encoding
>scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th
>pattern is a filler, and the 16th pattern means that the next byte
>encodes the 254 next most common ideograms, the 255 bit pattern
    ...............................
>this idea would also work for English. Assuming that the average
>English word takes 6*8 bits (average length of 5 + terminating space
>* 8 bit ascii) you could cut the disk space required for computer..
   and I SAY, there is a reason: 	
  I would like to propose a criterion  for ( or attribute of) coding of
 text. Coding  is LOCAL (within n) if from each 3n bytes I can derive
 one (middle) letter of the encoded text. In this sense, the  coding
 based on pairs (polish, spanish, sh for s^ etc are all local (within 2)
 but coding based on frequency of words is not (beside being language
 dependent).  ( Please recall that I consider ideographs to be 'words' made
 of strokes.)
 The coding based on frequency of characters is Local, (and if we accept
 the above explained modifier-bit convention) also Language independent.

 I do believe that since we are discussing CHARACTER sets - we should 
 leave out the coding based an dictionaries (word sets) - they have their
 funnction - but are much more (application, language, etc ) dependent
 than the character sets. Lets reach some agreement on letters first.
 

	Yours  Dr. pom  -  a scientist  -   (quite mad)   pom@under.s1.gov 

alin@sunybcs.uucp (Alin Sangeap) (08/19/87)

In article <15381@mordor.s1.gov> pom@s1-under.UUCP () writes:
|Besides I have VERY CONSTRUCTIVE
|(  insiders info ) FACT on ceddilas, umlats, haceks, and other such ...
| [modifiers] namely : In all langauges I know, there are many kinds,
| but ANY PARTICULAR LETTER either has one - or it does not. That means
| that we need to reserve just 1 bit ( 0.. unmodified) and (1.. modified).
|  to take care of dozens of languages.

| To disprove my conjecture, name one language with Latin-based alphabet 
| and one letter in that alphabet, which admits more then one modifier.
|
Romanian  (of the latin group of languages)
Letter a: a (plain)
          a with a paranthesis facing upwards above it
          a with a french-style accent circumflex (the upper half of a
                                                   45-degree-tilted square)
f
i
l
l
e
r
--
           Alin Sangeap                    State U. of N.Y. @ Buffalo C.S.
INTERNET:  alin@cs.buffalo.edu             BITNET:    alin@sunybcs.bitnet
UUCP:     {allegra,ames,boulder,decvax,rocksanne,rutgers,watmath}!sunybcs!alin
NSA:       please decode all secret cryptography ciphers; best of wishes, A.

wales@ucla-cs.UUCP (08/19/87)

In article <15381@mordor.s1.gov> pom@s1-under.UUCP () writes:
>Besides I have VERY CONSTRUCTIVE (insiders info) FACT on cedillas,
>umlauts, haceks, and other such ... [modifiers] namely : In all lan-
>guages I know, there are many kinds, but ANY PARTICULAR LETTER either
>has one - or it does not. That means that we need to reserve just 1
>bit (0.. unmodified) and (1.. modified).  to take care of dozens of
>languages.

>To disprove my conjecture, name one language with Latin-based alphabet 
>and one letter in that alphabet, which admits more than one modifier.

Good try, really, but there are several counterexamples:

Czech.  "U" can have an acute accent, or a small circle.  Also, "E"
    can have either an acute accent or a "hacek" (V-like accent).

French.  "E" can have an acute, grave, or circumflex accent, or a
    diaeresis (two dots).  "A" and "U" can have either a grave or
    a circumflex accent.  "I" can have either a circumflex accent
    or a diaeresis.

Hungarian.  "O" and "U" can have a regular acute accent, a regular
    umlaut (two dots), or a "long" umlaut (two acute accents).

Polish.  "Z" can have an acute accent, or a single dot.

Romanian.  "A" can have a breve ("short" sign, like a small U) or a
    circumflex.

Swedish.  "A" can have either an umlaut, or a small circle.

Vietnamese.  There are several different kinds of accent marks used in
    this language to indicate tones (syllable pitch patterns), and as
    far as I'm aware, any of these accents may occur on any vowel.
    (And, yes, modern Vietnamese *does* use the Latin alphabet.)

It may or may not be relevant, for purposes of this discussion, to note
that some of the above languages treat the "modified" versions of their
letters as completely distinct letters in their own right.

-- Rich Wales // UCLA Computer Science Department // +1 213-825-5683
	3531 Boelter Hall // Los Angeles, California 90024-1596 // USA
	wales@CS.UCLA.EDU   ...!(ucbvax,rutgers)!ucla-cs!wales
"Sir, there is a multilegged creature crawling on your shoulder."

rob@pbhye.UUCP (Rob Bernardo) (08/19/87)

In article <15381@mordor.s1.gov> pom@s1-under.UUCP () writes:
+(  insiders info ) FACT on ceddilas, umlats, haceks, and other such ...
+ [modifiers] namely : In all langauges I know, there are many kinds,
+ but ANY PARTICULAR LETTER either has one - or it does not. That means
+ that we need to reserve just 1 bit ( 0.. unmodified) and (1.. modified).
+  to take care of dozens of languages.

Not so.
	French allows ' and ^ over all vowels and ` additionally over "e".

	Hungarian allows two dots, ', and ''  over "u" and "o".

	Spanish allows ' and two dots over "u".


Just to name a few off the top of my head. :-)
-- 
I'm not a bug, I'm a feature.
Rob Bernardo, San Ramon, CA	(415) 823-2417	{pyramid|ihnp4|dual}!ptsfa!rob

sandi@apollo.uucp (Sandra Martin) (08/19/87)

Peter O. Mikes @ S-1 Project, LLNL writes:
> So e.g. if switch  ( ROM, printwheel,..) is set to German , modified o will
> put two dots ( umlaut) above o; In Czech the same bit will put ' above
>  'aeiou' but will put inverted ^ over consonants  ( since only 'aeiou'
>  are allowed to have  '  and only consonants can  have ^, and so it goes.
>
>>But this doesn't
>>address all problems I mentioned. How to construct a general character
>>with an arbitrary accent, umlaut or other diacritic mark? An 8-bit
>>enumarate isn't sufficient.
>
> The problem you  (somebody) mentioned is hereby addressed.
> To disprove my conjecture, name one language with Latin-based alphabet 
> and one letter in that alphabet, which admits more then one modifier.

French allows an 'e' to take an acute or grave accent, as well as
a circumflex and an umlaut. Examples:

    e'cole      (school)
    privile`ge  (privilege)
    e^tre       (to be)
    Noe"l       (Christmas)

The 'a', 'i', and 'u' in French also can take multiple diacriticals. And in
Swedish, the 'a' can take a ring or an umlaut. I imagine there are other
examples, but these are the ones I could think of off the top of my head.

   Sandra Martin, Apollo Computer
   UUCP:  ...{mit-erl,mit-eddie,yale,uw-beaver,decvax!wanginst}!apollo!sandi
   ARPA:  apollo!sandi@eddie.mit.edu

joe@haddock.ISC.COM (Joe Chapman) (08/19/87)

>Besides I have VERY CONSTRUCTIVE
>(  insiders info ) FACT on ceddilas, umlats, haceks, and other such ...
> [modifiers] namely : In all langauges I know, there are many kinds,
> but ANY PARTICULAR LETTER either has one - or it does not.
> To disprove my conjecture, name one language with Latin-based alphabet 
> and one letter in that alphabet, which admits more then one modifier.

I don't even have to get obscure: in French an ``e'' can take one of
three accents (grave, acute, circumflex) or a diaeresis.

> english sh  is perverse Hungarian's sz
> is actualy one sound (soft s or s^).

Minor quibble: in Hungarian, sz is pronounced like the s in English
"soup", and s is pronounced as in "shop".

--
Joe Chapman
harvard!ima!joe

scottha@athena.TEK.COM (Scott Hankerson) (08/19/87)

In article <15381@mordor.s1.gov> pom@s1-under.UUCP () writes:
>>>But this doesn't
>>>address all problems I mentioned. How to construct a general character
>>>with an arbitrary accent, umlaut or other diacritic mark? An 8-bit
>>>enumarate isn't sufficient.
>
> The problem you  (somebody) mentioned is hereby addressed.
> To disprove my conjecture, name one language with Latin-based alphabet 
> and one letter in that alphabet, which admits more then one modifier.

In French, one can have up to four different modifiers over a particular
letter (the letter e can have a grave accent, one going the other direction
(I never can remember what they're called in English (')), a circumflex,
or an umlaut.  In addition, I may want to quote from other languages if
I write in French.
 
>   Oh, just BTW - using poor ASCII, which has no modifier bit, I am
>   using the convention that modifier is indictaed by h ( e.g. 
>  a word:  (modified_s)ot would appear as shot. (which is quite wastefull
>  as whole h is needed to perform function of one bit).

Surely this would introduce even more ambiguities.  In German, an h 
lengthens the vowels.  Is a vowel followed by an h an umlauted vowel?
An umauted vowel followed by an h? Or simply a vowel followed by an h?

I haven't seen anyone mention an ISO standard yet.  I was under the impression
that there was one.  Am I wrong?  I don't much care for the alternates
that I have seen used by terminal manufacturers in the US which is a
keyboard with many of the special symbols replaced with accented characters.
That may be nice for writting documents, but it must be intollerable for
coding in C or any other programming language which uses many nonalphabetic
symbols.

sommar@enea.UUCP (08/19/87)

In a recent article gordan@maccs.UUCP (Gordan Palameta) writes:
>t umlaut or q cedilla would probably be used very rarely, nor is it likely
>that anyone would go to the trouble of designing a font to accomodate such
>characters.  Another cost of such generality would be that accents and other
>marks would probably have to be indicated by escape sequences in conjunction
>with the unmodified letter.  This would make string-processing software more
>complicated (and slower), and text would be longer.

If you had something like the 8th bit meaning that the following byte is a
modifier, this would quite moderately increase the length of the text and the 
string-processing time. This solution does however not solely address the
problem that different languages have different collating sequences.

>Not at all, just define a 256-byte lookup table in an include file, and modify
>the code to
>     if (coll[c] >= FIRST_CHAR && coll[c] <= LAST_CHAR)
>with very little loss of efficiency.  To accomodate perverse languages like
>Spanish and Polish which insist on two-letter combinations for sorting,

Spanish and Polish aren't more perverse than English.
  Of course I know about look-tables. I have myself written a programme
that uses a two-level look-up table for comparing words. (And the words
are transcribed in three levels. You don't want the hyphen in a hyphenated
word to be significant.) But to have that in every single programme that
does string comparisons. No, thank you. It does increase the complexity
and the readability of the code.
  It would be much more nice if "ch1 >= ch2" meant that ch1 comes before
or at the same position as ch2 in alphabet we currently have chosen. (It's 
unclair what equality is when modified letter are involved. Probably you will
need two kinds of equality.)

>Never mind the French; what if things had turned out differently in 1588
>with the Armada, and the Spanish had invented computers?  Or the Chinese?

I just took the Frenchmen as an example, OK? No matter who had invented
the computers; if their language also had had the dominating position that
English have, that language would have set the standard for character
representation with no other language in mind. I took French as example
since they have plenty of for sorting non-significant modifiers. 

>Followups to alt.universes.

If you find the subject that uninteresting, why did you ever write
the article at all?
-- 

Erland Sommarskog       
ENEA Data, Stockholm    
sommar@enea.UUCP        

alan@pdn.UUCP (Alan Lovejoy) (08/20/87)

In article <15381@mordor.s1.gov> pom@s1-under.UUCP () writes:
> To disprove my conjecture, name one language with Latin-based alphabet 
> and one letter in that alphabet, which admits more then one modifier.

French:  "e" can have either a left overstrike, a right overstrike or
a hat "^".  Admittedly, only the right overstrike changes the
pronunciation in MODERN Parisian French.  However, in Navajo vowels
must simultaneously be markable for nasality by a cedilla, as well as
by diacritical marks to indicate other variations.  

All vowels can be nasal/nonnasal and voiced/unvoiced, and I believe
there are exotic languages with even other variations (I'll have to
look that up, though).    

Things get even stickier when you consider the problem of multilingual
text, however (e.g, "I said to him, 'Who are you'?  To which he
answered, 'Je ne parle pas anglais.  Parlez-vous francais'?").

--Alan "Bozhe moi! Kommitjet Gosudarstvjenoj Bjezopastnostji
sljedujet...!!!!" Lovejoy

dean@hyper.UUCP (Dean Gahlon) (08/20/87)

in article <15381@mordor.s1.gov], pom@under..ARPA (Peter O. Mikes) says:
] 
]  The problem you  (somebody) mentioned is hereby addressed.
]  To disprove my conjecture, name one language with Latin-based alphabet 
]  and one letter in that alphabet, which admits more then one modifier.
] [random stuff deleted] 
]  
]  ( I am not quite sure if all mono-anglo-phones realise that english is
]  actually using pairs for sounds ( english sh  is perverse Hungarian's sz
]  is actualy one sound (soft s or s^). 

	You mentioned it yourself - Hungarian. (O, for instance, has 
both long and short umlauts).

sommar@enea.UUCP (Erland Sommarskog) (08/21/87)

This was intented to go by mail, but it came back to me. (Athena was
an unknown host.)

In article <1583@athena.TEK.COM> you write:
>I haven't seen anyone mention an ISO standard yet.  I was under the impression
>that there was one.  Am I wrong?  

You must have missed it. Tim Lasko from DEC wrote an article on the
status of the eight ISO standards. That was in comp.std.internat some
weeks ago.

The ISO standards are not sufficient. I don't whether you have read
my articles in comp.std.internat where I discussed the need for 
another concept for character represenation. I find it quite
inconvient to find the end of my alphabet somewhere at code 200.

>                                  I don't much care for the alternates
>that I have seen used by terminal manufacturers in the US which is a
>keyboard with many of the special symbols replaced with accented characters.
>That may be nice for writting documents, but it must be intollerable for
>coding in C or any other programming language which uses many nonalphabetic
>symbols.

Having screens with national characters replacing barckets and
braces is no problem. Many languages that uses these characters
allows alternatives. (E.g. [] can be replaced by (..) in most Pascal
dialects.) Languages that does not provide alternatives, I simply 
refuse to use. 
-- 

Erland Sommarskog       
ENEA Data, Stockholm    
sommar@enea.UUCP        

andersa@kuling.UUCP (Anders Andersson) (08/22/87)

In article <719@maccs.UUCP> gordan@maccs.UUCP (Gordan Palameta) writes:
>t umlaut or q cedilla would probably be used very rarely, nor is it likely
>that anyone would go to the trouble of designing a font to accomodate such
>characters.  Another cost of such generality would be that accents and other
>marks would probably have to be indicated by escape sequences in conjunction
>with the unmodified letter.  This would make string-processing software more
>complicated (and slower), and text would be longer.

The existance of something like four different 94-character "Latin" sets
suggests that 8 bits wouldn't suffice anyway, although I haven't counted the
exact number of existing glyph combinations. I believe Welsh includes some
strange accented consonants, but I don't remember which (maybe w^). If you
also take Vietnamese into account, which allows several accents used at the
same time, you'd definitely overflow the table. TeX provides for arbitrary
combinations of accents, and I think this approach is quite simple (although
I don't suggest TeX for the encoding scheme to be used for files in general).

I don't think somebody manually has to design a font for each and every
combination, as the acute accent over e looks pretty much the same as the
acute accent over o, and the combination could be done automatically at
display-time. Some characters will need special treatment though, like
capital Swedish A with circle above (they should usually touch each other)
and Polish bar-crossed L. The amount of programming and CPU power to be
used for this depends on what quality and resolution of display you require.

If this general approach turns out to be the most practical one technically,
some people may of course go hog wild putting circles under X and cedillas
over 7, but there is as little point in stopping them as in preventing
people from writing "fiYw#s" with a proportional font. Just apply the
general accent attachment rule and they'll be quiet...

>     if (coll[c] >= FIRST_CHAR && coll[c] <= LAST_CHAR)
>with very little loss of efficiency.  To accomodate perverse languages like
>Spanish and Polish which insist on two-letter combinations for sorting,

What about the thing "Mac" or "Mc" in English (Scottish?) proper names?
I agree this example is a little extreme in comparison to the Spanish
"graphemes" ch, ll and rr (?), as well as czech ch. Maybe the English
don't mind seeing "McDonald" sorted after "Machiavelli", or whatever the
rule is/was - has it been abolished by now?

There are different kinds of sorting even within one language, depending on
the context. Donald E. Knuth provides a wonderful collection of rules for
bibliographic use in the beginning of his "Fundamentals ..." volume on
Sorting & Searching, such as ignoring articles and spelling out numbers.
These rules don't apply to filenames in a UNIX directory, I think!
-- 
Anders Andersson, Dept. of Computer Systems, Uppsala University, Sweden
Phone: +46 18 183170
UUCP: andersa@kuling.UUCP (...!{seismo,mcvax}!enea!kuling!andersa)

andersa@kuling.UUCP (Anders Andersson) (08/23/87)

In article <2201@enea.UUCP> sommar@enea.UUCP(Erland Sommarskog) writes:
>>     if (coll[c] >= FIRST_CHAR && coll[c] <= LAST_CHAR)

>  Of course I know about look-tables. I have myself written a programme
>that uses a two-level look-up table for comparing words. (And the words
>are transcribed in three levels. You don't want the hyphen in a hyphenated
>word to be significant.) But to have that in every single programme that
>does string comparisons. No, thank you. It does increase the complexity
>and the readability of the code.

Nobody coding an application should of course ever have to implement those
string comparison routines explicitely over and over again, but rather
refer to generec library routines, like isalpha(c). Several libraries can
be provided for different kinds of sorting (if you want "RFC666.TXT" after
"RFC-INDEX.TXT" but "RFC888.TXT" before because of the way numbers are
spelled out, that's up to your choice). If the string comparison library
is too big to be linked into your 42 executables, then provide it as a
sharable image, or (in an emergency) put it in your favourite kernel...
-- 
Anders Andersson, Dept. of Computer Systems, Uppsala University, Sweden
Phone: +46 18 183170
UUCP: andersa@kuling.UUCP (...!{seismo,mcvax}!enea!kuling!andersa)

henry@utzoo.UUCP (Henry Spencer) (08/23/87)

> I haven't seen anyone mention an ISO standard yet.  I was under the impression
> that there was one.  Am I wrong?  I don't much care for the alternates
> that I have seen used by terminal manufacturers in the US which is a
> keyboard with many of the special symbols replaced with accented characters.

Unfortunately, this *is* the (old) ISO standard.  Seven or eight of the
special symbols in ASCII are in positions which the ISO 7-bit standard
designates as "reserved for national alphabets", or words to that effect.
ASCII, of course, doesn't need any extra national-alphabet symbols, so
it filled those positions with neat but ASCII-specific things.

The mess that results from this was a major motivation for the new ISO
Latin standard, which is an 8-bit character set that includes all of ASCII
plus some extra goodies plus pretty well everything needed to write the
Latin-derived languages.  ISO Latin is unquestionably the wave of the
future -- it will help a lot and won't hurt much.  It WILL hurt a little
(for example, there aren't many Unix programs that use the top bit of char
for something else, but those few are exactly the programs that one least
wants to modify:  the editors and the shells!), but it won't be a tenth
as painful as the more drastic changes needed to do seriously non-Latin
alphabets like Chinese and Arabic.

My own personal view is that ISO Latin is a Good Thing, I am planning my
software for it, and everybody else should too.  The various proposals
for dealing with the non-Latin alphabets, on the other hand, all seem to
me to have rather higher price tags, and I take a "wait and see" attitude
toward them.
-- 
Apollo was the doorway to the stars. |  Henry Spencer @ U of Toronto Zoology
Next time, we should open it.        | {allegra,ihnp4,decvax,utai}!utzoo!henry

gordan@maccs.UUCP (Gordan Palameta) (08/25/87)

In article <8462@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>
>My own personal view is that ISO Latin is a Good Thing, I am planning my
>software for it, and everybody else should too.

Yay.  Hear that, programmers?  When you write the Next Great Program
(that slices, dices, and upholsters furniture), DON'T TOUCH the eighth bit!
This means YOU!

>                                                 The various proposals
>for dealing with the non-Latin alphabets, on the other hand, all seem to
>me to have rather higher price tags, and I take a "wait and see" attitude
>toward them.

Ummm, Arabic, Hebrew, Greek, and Cyrillic are or will shortly be taken care of
by the same standardization process that produced ISO Latin-1.  Each uses a
different upper half of the character set.  I think there's even standard
escape sequences suggested for switching between the different ISO character
sets, for terminals capable of displaying more than one such set. 


On a different note, the first 32 positions of the upper half of the character
set are supposed to be reserved for a bunch of new non-printing characters.
On page 26 of August 1985 BYTE, a preliminary version of ISO Latin 1 is listed,
and some of these "control" characters have names, e.g. 08/04 = IND, 08/05 =
NEL, etc.  It is implied in the accompanying letter that these are intended
for word-processing commands.

Have the uses of these been standardized?  If so, it would certainly seem
worth publicizing and discussing here.

henry@utzoo.UUCP (Henry Spencer) (08/25/87)

> Ummm, Arabic, Hebrew, Greek, and Cyrillic are or will shortly be taken care of
> by the same standardization process that produced ISO Latin-1.  Each uses a
> different upper half of the character set...

Unfortunately, this brings us back to the old problem that the meaning of a
byte is context-dependent.  There were alternate character sets for the Latin
languages before, and standard escape sequences for switching; much good it
did us.  Anything with mode-switching is an order of magnitude harder to
handle intelligently than a modeless code like ASCII or ISO Latin.  Don't
forget the right-to-left problems in Arabic and Hebrew, for that matter.
I don't know what the best answer is, and am not convinced that anyone else
does either.  Hence "wait and see".  My sympathy goes out to the people who
have compelling commercial reasons to do something about these issues now;
it can't be much fun.
-- 
"There's a lot more to do in space   |  Henry Spencer @ U of Toronto Zoology
than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry

daveb@geac.UUCP (Brown) (08/25/87)

In article <737@maccs.UUCP> gordan@maccs.UUCP (Gordan Palameta) writes:
>On a different note, the first 32 positions of the upper half of the character
>set are supposed to be reserved for a bunch of new non-printing characters.
>Have the uses of these been standardized?  If so, it would certainly seem
>worth publicizing and discussing here.

I cannot find the names in my existing ISO docuumentation, could some
kind person please post a chart of these?

--dave (diluting my ignorance) collier-brown

ps: I have "ISO 2022 Information processing -- ISO 7-bit and 8-bit
  coded character sets -- Code extension techniques", Second edition - 
  1982-12-15, but id doesn't name the characters, just says they're there.
-- 
 David Collier-Brown.                 {mnetor|yetti|utgpu}!geac!daveb
 Geac Computers International Inc.,   |  Computer Science loses its
 350 Steelcase Road,Markham, Ontario, |  memory (if not its mind)
 CANADA, L3R 1B3 (416) 475-0525 x3279 |  every 6 months.

wcs@ho95e.ATT.COM (Bill.Stewart) (08/29/87)

In article <718@maccs.UUCP> gordan@maccs.UUCP (Gordan Palameta) writes:
:>In article <8410@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
:>) [if the French had invented computers, ........
:If the French had invented computers, there is little doubt that a character
:set to support French would have appeared sooner.  .....
:There is no chance that a 16-bit character set would have sprung up, fully
:formed -- computer memory used to be very, very expensive (weren't characters
:six bits, once upon a time?)

It would be variable length, from 5 to 14 bits ... but the last three
or four aren't pronounced :-).   More seriously, while a given
numeric character representation doesn't correspond identically to the
collating sequence, (viz. English [Aa]<[Bb]... vs ascii or ebcdic),
one can build a table listing character representations in sequence,
and use it for sorting rather than building the sequence into the
language, as was suggested with #pragma language(Franglais).  While one
might use the representation directly when the collating sequence
doesn't really matter (e.g. building unique lists), one can still build
library functions to compare words.
-- 
#				Thanks;
# Bill Stewart, AT&T Bell Labs 2G218, Holmdel NJ 1-201-949-0705 ihnp4!ho95c!wcs

sommar@enea.UUCP (09/06/87)

It is now some weeke ago since I wrote an article where I asked
for a change of paradigm for character representation. It was
followed about discussion of sorting and what languages which
used what acccents.
  The last days some new ideas have been coming up, though I'd 
like to comment these. In a separate article I am presenting an own
proposal to a character standard.

Basically we have seen two approaches to the problem. Alan
Lovejoy's idea of a character palette and ISO 6937. The palette
first.
  The idea has its points, but I feel it is over-worked. Do you 
really need codes for all possible human sounds? Computers today
transmits written language, not spoken. But with speech synthesis
advancing, we may need this standard in some decades.
  Alan hasn't spoken of sorting and character comparison. Just
comparing the (arbitrarily) numeric codes doens't seem meaningful.
Have you thought of introduce language dependicy here?

So ISO 6937. Until Bruce Sherwood wrote his article I hadn't 
heard of this standard. Obviuosly this standard does a lot 
of what I requested. By introducing mute modifiers a lot more
letters can be handled.
  But apparently it was ahead of its times. Fridrik Skulason
writes: "6937 may be better than 8859 for some purposes 
(communication that is), but as a standard character set for 
terminals it is useless. The reason ... Simple. Most existing 
software packages assume that (1 char in text = 1 char on screen)."
  That is giving up, I'd say. Yes, ISO 6937 would require many
existing programmes being rewritten. It would also require
progress in hardware for handle the mute characters properly.
But can't do we this? If Fridrik is right we are doomed to live
the ASCII/8859 stone age approach forever.
  Now I think he is wrong. But of course you have work more
for a "progressive" standard to gain popularity than a defensive
one like 8859. What you need is some leading manufacturer to 
start using it, or an important customer to require it.
-- 

Erland Sommarskog       
ENEA Data, Stockholm    
sommar@enea.UUCP