[alt.folklore.computers] Int'l Character set

dankg@lightning.Berkeley.EDU (Dan KoGai) (06/01/90)

In article <3410@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
>
>However, if you go for the more state-of-the-art ISO 8859 character
>sets, you get to use the 8th bit; all the 8859 character sets are ASCII
>in the first 128 positions (8th bit zero), and have additional
>characters including accented letters, etc. in the next 128 positions. 
>ISO 8859/1, the Western Europe and (North?) American (in the sense of
>the American continents, not the US) character set, has both "$" in the
>usual ASCII position, as well as "pound sterling".

	Is that identical to Mac|Next Character sets?  Well, I doubt it
but I think total number of characters, if each of diacriticized letter
is distinctive character (in case of Macintosh), total number will well
exceed 0x100.  I think it implements diacritics as two characters and it's
up to terminal|screen driver to resolve printing.

>(There's also ISO 10646, which is a *big* character set under
>development that will supposedly give you all the characters in the
>world, or at least a big subset including Japanese & Chinese and the
>like....)

	But HOW?  Chinese alone has 100,000 or more characters and that's
more than 0x10000!  Well, I think some of rarely used characters will be
omitted like Japanese JIS character set, which omitts a lot of old and unused
characters.  But some people do need them (Japan has a strange law of birth
record:  Even though your parents misspelled your name, they have to register
AS IT IS!  This is pain because it's not just a matter of string but character
itself) so user character editor is almost standard feature--it uses unused
chars (JIS std is capable of storing up to (# of !isctrl)^2 char sets.  Upper 
bit and cntl chars are avoided for ascii compatibility and there are some gaps
like EBCDIC).
	I think it's wiser to set local standard and standardalize "language
code" to switch character sets.  That's how Mac implements international
character sets in Script Manager:  All you need is correct fonts.

----------------
____  __  __    + Dan The "^[$B^[(J" Man
    ||__||__|   + E-mail:	dankg@ocf.berkeley.edu
____| ______ 	+ Voice:	+1 415-549-6111
|     |__|__|	+ USnail:	1730 Laloma Berkeley, CA 94709 U.S.A
|___  |__|__|	+	
    |____|____	+ "What's the biggest U.S. export to Japan?" 	
  \_|    |      + "Bullshit.  It makes the best fertilizer for their rice"

guy@auspex.auspex.com (Guy Harris) (06/01/90)

>	Is that identical to Mac|Next Character sets?  Well, I doubt it

I do as well.  I seem to remember comparing the Mac character set with
8859/1 and seeing that they weren't the same.  (I don't think 8859/1 is
the same as the PC character set, either.  So it goes....)

>	But HOW?  Chinese alone has 100,000 or more characters and that's
>more than 0x10000!

So?  Who said 10646 fit in 16 bits?  Here's some stuff from a posting by
Dominic Dunlop:

	 5. SC2's answer to life, the universe and everything is DP
	    (draft proposal) 10646, which defines a 32-bit wide
	    character set with 8- and 16-bit wide canonical versions
	    for storage and transmission, and a 24-bit wide
	    processing version for those who can get by with only
	    eight million characters or so.

>	I think it's wiser to set local standard and standardalize "language
>code" to switch character sets.  That's how Mac implements international
>character sets in Script Manager:  All you need is correct fonts.

As long as you can switch character sets within a document....

More from Dominic, in response to some questions (">" are my questions):

  > Are "8- and 16-bit wide canonical versions" capable of representing all
  > 2^24 or 2^32 characters in the set?

  Yes.

  > If so, do they use some run-length
  > encoding scheme on the upper bits, Xerox-style (or embedded announcement
  > escape sequences, which amount to much the same thing)?

  Embedded escapes.  Can't seek on the canonicalised streams.

  > If so, does
  > this mean that an ASCII-only file can be thought of as being a file in
  > this character set?

  Yes.  Not clear whether a little pre-announcement might be required.
  (Strictly, I'm talking about an ISO 646 file, rather than ASCII.)

ljdickey@water.waterloo.edu (L.J.Dickey) (06/02/90)

In article <1990Jun1.010720.16597@agate.berkeley.edu> dankg@lightning.Berkeley.EDU (Dan KoGai) writes:
>In article <3410@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
>
> [ lots left out ]
>
>>(There's also ISO 10646, which is a *big* character set under
>>development that will supposedly give you all the characters in the
>>world, or at least a big subset including Japanese & Chinese and the
>>like....)
>
>	But HOW?  Chinese alone has 100,000 or more characters and that's
>more than 0x10000!

Well,   *big*   means something more than 100,000 !

I think that ISO 10646 allows something on the order of
4*10^9 characters.  (Think of four byte addressing.)

Unless there is some alphabet I don't know about, that number is
big enough to index every human alphabet on earth and then some.