dankg@lightning.Berkeley.EDU (Dan KoGai) (06/01/90)
In article <3410@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes: > >However, if you go for the more state-of-the-art ISO 8859 character >sets, you get to use the 8th bit; all the 8859 character sets are ASCII >in the first 128 positions (8th bit zero), and have additional >characters including accented letters, etc. in the next 128 positions. >ISO 8859/1, the Western Europe and (North?) American (in the sense of >the American continents, not the US) character set, has both "$" in the >usual ASCII position, as well as "pound sterling". Is that identical to Mac|Next Character sets? Well, I doubt it but I think total number of characters, if each of diacriticized letter is distinctive character (in case of Macintosh), total number will well exceed 0x100. I think it implements diacritics as two characters and it's up to terminal|screen driver to resolve printing. >(There's also ISO 10646, which is a *big* character set under >development that will supposedly give you all the characters in the >world, or at least a big subset including Japanese & Chinese and the >like....) But HOW? Chinese alone has 100,000 or more characters and that's more than 0x10000! Well, I think some of rarely used characters will be omitted like Japanese JIS character set, which omitts a lot of old and unused characters. But some people do need them (Japan has a strange law of birth record: Even though your parents misspelled your name, they have to register AS IT IS! This is pain because it's not just a matter of string but character itself) so user character editor is almost standard feature--it uses unused chars (JIS std is capable of storing up to (# of !isctrl)^2 char sets. Upper bit and cntl chars are avoided for ascii compatibility and there are some gaps like EBCDIC). I think it's wiser to set local standard and standardalize "language code" to switch character sets. That's how Mac implements international character sets in Script Manager: All you need is correct fonts. ---------------- ____ __ __ + Dan The "^[$B^[(J" Man ||__||__| + E-mail: dankg@ocf.berkeley.edu ____| ______ + Voice: +1 415-549-6111 | |__|__| + USnail: 1730 Laloma Berkeley, CA 94709 U.S.A |___ |__|__| + |____|____ + "What's the biggest U.S. export to Japan?" \_| | + "Bullshit. It makes the best fertilizer for their rice"
guy@auspex.auspex.com (Guy Harris) (06/01/90)
> Is that identical to Mac|Next Character sets? Well, I doubt it I do as well. I seem to remember comparing the Mac character set with 8859/1 and seeing that they weren't the same. (I don't think 8859/1 is the same as the PC character set, either. So it goes....) > But HOW? Chinese alone has 100,000 or more characters and that's >more than 0x10000! So? Who said 10646 fit in 16 bits? Here's some stuff from a posting by Dominic Dunlop: 5. SC2's answer to life, the universe and everything is DP (draft proposal) 10646, which defines a 32-bit wide character set with 8- and 16-bit wide canonical versions for storage and transmission, and a 24-bit wide processing version for those who can get by with only eight million characters or so. > I think it's wiser to set local standard and standardalize "language >code" to switch character sets. That's how Mac implements international >character sets in Script Manager: All you need is correct fonts. As long as you can switch character sets within a document.... More from Dominic, in response to some questions (">" are my questions): > Are "8- and 16-bit wide canonical versions" capable of representing all > 2^24 or 2^32 characters in the set? Yes. > If so, do they use some run-length > encoding scheme on the upper bits, Xerox-style (or embedded announcement > escape sequences, which amount to much the same thing)? Embedded escapes. Can't seek on the canonicalised streams. > If so, does > this mean that an ASCII-only file can be thought of as being a file in > this character set? Yes. Not clear whether a little pre-announcement might be required. (Strictly, I'm talking about an ISO 646 file, rather than ASCII.)
ljdickey@water.waterloo.edu (L.J.Dickey) (06/02/90)
In article <1990Jun1.010720.16597@agate.berkeley.edu> dankg@lightning.Berkeley.EDU (Dan KoGai) writes: >In article <3410@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes: > > [ lots left out ] > >>(There's also ISO 10646, which is a *big* character set under >>development that will supposedly give you all the characters in the >>world, or at least a big subset including Japanese & Chinese and the >>like....) > > But HOW? Chinese alone has 100,000 or more characters and that's >more than 0x10000! Well, *big* means something more than 100,000 ! I think that ISO 10646 allows something on the order of 4*10^9 characters. (Think of four byte addressing.) Unless there is some alphabet I don't know about, that number is big enough to index every human alphabet on earth and then some.