kupfer@ucbvax.ARPA (Mike Kupfer) (10/09/85)
I think that 8 bits is still not enough if you want to include oriental or other non-Roman character sets. So using only 8 bits is reasonable if you assume that a typical UNIX system will not be able to display these characters (so why bother with them), but you should realize that this assumption is being made. -- Mike Kupfer Xerox Corporation - SDD kupfer.pa@xerox.ARPA ...!ucbvax!kupfer
david@ukma.UUCP (David Herron, NPR Lover) (10/10/85)
In article <10597@ucbvax.ARPA> kupfer@ucbvax.UUCP (Mike Kupfer) writes: >I think that 8 bits is still not enough if you want to include oriental >or other non-Roman character sets. So using only 8 bits is reasonable >if you assume that a typical UNIX system will not be able to display >these characters (so why bother with them), but you should realize that >this assumption is being made. There's some work being done at Xerox, etc in representing foreign character sets and word-processing them -- Look in Sci. Am. in an issue a year or two ago. I think maybe that was the 'topic' for that month even. The method (As I recall) described in one article was to define one code as an "escape" code. You could follow the escape code with commands to switch character sets or whatever. So instead of an absolute encoding, you had a context sensitive encoding. Which will give you greater flexibility in the character sets you are storing. (They are aiming for a system whereby ALL text, regardless of language, may be word-processed, etc). One of the most interesting things I remember is that some languages have characters which *surround* other characters. This was making for an interesting typesetting problem. -- David Herron, ukma!david@ANL-MCS.ARPA, cbosgd!ukma!david (Soon -- david@UKMA.BITNET, and (hopefully) david@ukma.csnet) Hackin's in me blood! My mother was known as Miss Hacker before she married!
michaelm@bcsaic.UUCP (michael b maxwell) (10/10/85)
In article <10597@ucbvax.ARPA> kupfer@ucbvax.UUCP (Mike Kupfer) writes: >I think that 8 bits is still not enough if you want to include oriental >or other non-Roman character sets. So using only 8 bits is reasonable >if you assume that a typical UNIX system will not be able to display >these characters... Along these lines, readers of this newsgroup may be interested in the ff. article: Anderson, Lloyd B. 1984. "Multilingual Text Processing in a Two- Byte Code." 10th. Int'l. Conf. on Computational Linguistics, pg. 1-4. Part of the abstract: ...standards committees are now discussing a two-byte code for multilingual information processing... 65,536 separate character and control codes, enough to make permanent code assignments for all national alphabets of the world, and also to include Chinese/ Japanese characters... It is possible to arrange alphabet codes to provide transliteration equivalence... He discusses the problems of diacritics, digraphs, alphabetization, etc. The committee referred to is apparently the "ANSI X3L2" committee (at least it's the only committee I can find reference to in the text). -- Mike Maxwell Boeing Artificial Intelligence Center ..uw-beaver!{uw-june,ssc-vax}!bcsaic!michaelm