[comp.std.misc] Int'l Character set

dankg@lightning.Berkeley.EDU (Dan KoGai) (06/01/90)

In article <3410@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
>
>However, if you go for the more state-of-the-art ISO 8859 character
>sets, you get to use the 8th bit; all the 8859 character sets are ASCII
>in the first 128 positions (8th bit zero), and have additional
>characters including accented letters, etc. in the next 128 positions. 
>ISO 8859/1, the Western Europe and (North?) American (in the sense of
>the American continents, not the US) character set, has both "$" in the
>usual ASCII position, as well as "pound sterling".

	Is that identical to Mac|Next Character sets?  Well, I doubt it
but I think total number of characters, if each of diacriticized letter
is distinctive character (in case of Macintosh), total number will well
exceed 0x100.  I think it implements diacritics as two characters and it's
up to terminal|screen driver to resolve printing.

>(There's also ISO 10646, which is a *big* character set under
>development that will supposedly give you all the characters in the
>world, or at least a big subset including Japanese & Chinese and the
>like....)

	But HOW?  Chinese alone has 100,000 or more characters and that's
more than 0x10000!  Well, I think some of rarely used characters will be
omitted like Japanese JIS character set, which omitts a lot of old and unused
characters.  But some people do need them (Japan has a strange law of birth
record:  Even though your parents misspelled your name, they have to register
AS IT IS!  This is pain because it's not just a matter of string but character
itself) so user character editor is almost standard feature--it uses unused
chars (JIS std is capable of storing up to (# of !isctrl)^2 char sets.  Upper 
bit and cntl chars are avoided for ascii compatibility and there are some gaps
like EBCDIC).
	I think it's wiser to set local standard and standardalize "language
code" to switch character sets.  That's how Mac implements international
character sets in Script Manager:  All you need is correct fonts.

----------------
____  __  __    + Dan The "^[$B^[(J" Man
    ||__||__|   + E-mail:	dankg@ocf.berkeley.edu
____| ______ 	+ Voice:	+1 415-549-6111
|     |__|__|	+ USnail:	1730 Laloma Berkeley, CA 94709 U.S.A
|___  |__|__|	+	
    |____|____	+ "What's the biggest U.S. export to Japan?" 	
  \_|    |      + "Bullshit.  It makes the best fertilizer for their rice"

guy@auspex.auspex.com (Guy Harris) (06/01/90)

>	Is that identical to Mac|Next Character sets?  Well, I doubt it

I do as well.  I seem to remember comparing the Mac character set with
8859/1 and seeing that they weren't the same.  (I don't think 8859/1 is
the same as the PC character set, either.  So it goes....)

>	But HOW?  Chinese alone has 100,000 or more characters and that's
>more than 0x10000!

So?  Who said 10646 fit in 16 bits?  Here's some stuff from a posting by
Dominic Dunlop:

	 5. SC2's answer to life, the universe and everything is DP
	    (draft proposal) 10646, which defines a 32-bit wide
	    character set with 8- and 16-bit wide canonical versions
	    for storage and transmission, and a 24-bit wide
	    processing version for those who can get by with only
	    eight million characters or so.

>	I think it's wiser to set local standard and standardalize "language
>code" to switch character sets.  That's how Mac implements international
>character sets in Script Manager:  All you need is correct fonts.

As long as you can switch character sets within a document....

More from Dominic, in response to some questions (">" are my questions):

  > Are "8- and 16-bit wide canonical versions" capable of representing all
  > 2^24 or 2^32 characters in the set?

  Yes.

  > If so, do they use some run-length
  > encoding scheme on the upper bits, Xerox-style (or embedded announcement
  > escape sequences, which amount to much the same thing)?

  Embedded escapes.  Can't seek on the canonicalised streams.

  > If so, does
  > this mean that an ASCII-only file can be thought of as being a file in
  > this character set?

  Yes.  Not clear whether a little pre-announcement might be required.
  (Strictly, I'm talking about an ISO 646 file, rather than ASCII.)

ljdickey@water.waterloo.edu (L.J.Dickey) (06/02/90)

In article <1990Jun1.010720.16597@agate.berkeley.edu> dankg@lightning.Berkeley.EDU (Dan KoGai) writes:
>In article <3410@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
>
> [ lots left out ]
>
>>(There's also ISO 10646, which is a *big* character set under
>>development that will supposedly give you all the characters in the
>>world, or at least a big subset including Japanese & Chinese and the
>>like....)
>
>	But HOW?  Chinese alone has 100,000 or more characters and that's
>more than 0x10000!

Well,   *big*   means something more than 100,000 !

I think that ISO 10646 allows something on the order of
4*10^9 characters.  (Think of four byte addressing.)

Unless there is some alphabet I don't know about, that number is
big enough to index every human alphabet on earth and then some.

domo@tsa.co.uk (Dominic Dunlop) (06/05/90)

In article <1990Jun1.010720.16597@agate.berkeley.edu>
dankg@lightning.Berkeley.EDU (Dan KoGai) follows up
<3410@auspex.auspex.com> from guy@auspex.auspex.com (Guy Harris):
>	Is [8859-1] identical to Mac|Next Character sets?  Well, I doubt it
>but I think total number of characters, if each of diacriticized letter
>is distinctive character (in case of Macintosh), total number will well
>exceed 0x100.  I think it implements diacritics as two characters and it's
>up to terminal|screen driver to resolve printing.

The Macintosh character set differs from that specified in international
standard 8859-1 for characters with codes > 128.  Too bad, if not
unexpected: the Mac's internationalization features have been ahead of the
field since day one, but Apple does not, to my knowledge, participate in
or contribute to the standards process at any level.  (8859-1 was published
in 1987.)

Apple's 2.0 release of its A/UX variant of UNIX is billed as having
internationalization features.  Sadly, these are based on Apple's
character set.  Confusingly, A/UX 2.0 also provides good support for X
Window, which provides explicit support for 8859-1 through 4.  It seems
one has a choice of incompatible options, even within the same system.
Plus ca change...  Or, more correctly, plus \215a change (Apple) / plus
\347a change (8859-1).

Thanks to Guy for dredging up that stuff from me on the forthcoming 10646
standard.  Saves me having to do it!
-- 
Dominic Dunlop

minow@mountn.dec.com (Martin Minow) (06/05/90)

In article  <1990Jun1.010720.16597@agate.berkeley.edu> 
dankg@lightning.Berkeley.EDU (Dan KoGai) asks about the internationalization
features of ISO 10646.

As I recall (I've only seen an early draft), 10646 is a registry of
characters.  I.e., describes the way a conforming device converts a
character into a shape.  In general, each script is assigned a range
of values, and characters are assigned within that range.

As Dan points out, this is not sufficient, and, as I recall, other standards
specify writing direction.  There are other problems as well.  For example,
does the code for ASCII ( represent "left parenthesis" or "open parenthesis."

While the Macintosh is evolving towards a multi-lingual environment, they
do not seem to be approaching the problem from in the same way as the
ISO Standard body which, in the long term, may be a problem for all of us.

Martin Minow
minow@thundr.enet.dec.com
The above does not represent the position of Digital Equipment Corporation

aronsson@lysator.liu.se (Lars Aronsson) (06/12/90)

minow@mountn.dec.com (Martin Minow) writes:

>While the Macintosh is evolving towards a multi-lingual environment, they
>do not seem to be approaching the problem from in the same way as the
>ISO Standard body which, in the long term, may be a problem for all of us.
                   ^^^^^

Of course, the word "which" in the above quote alludes to the fact
that Mac (Apple Inc.) do not follow ISO. On the other hand, the (mis-)
interpretation that "which" alludes to the ISO standard body, gives
the quote an entirely different meaning.

I live in Sweden, where the national alphabet contains A-Z plus three
"umlaut" letters. The Swedish version of the old 7-bit ISO-646 (called
Swedish ASCII) has replaced the "[", "\", and "]" characters in ASCII
for the national letters, as has many versions of ISO-646 in other
European countries. I think the audience can imagine how Pascal and C
program listings look when the printer uses Swedish ASCII. Most
terminals made here have a SET-UP switch for selecting Swedish or US
ASCII.

When ISO published the 8859 standard in 1989 (or 1988?), many of us
thought this problem was solved once and for all. At Linkoping
University, SunOS 4.1 will soon be installed having 8 bit characters
and using ISO 8859. Northern Europe will use ISO 8859-1.

Unfortunately, the world is no more the same. The Berlin Wall has
fallen and the Soviet Empire is tumbling. You cannot imagine how many
accent marks these Estonians, Latvians, Lithuanians, Poles, Czechs,
Slovacs, and Hungarians can invent! Not thinking about the Cyrillic
alphabet. And in 1992, the European Common Market will be a fact and
we will need to communicate with people in Portugal, Spain, Italy and
Greece.

Now, if we are lucky, all these letters are in ISO 8859-1 thru -9 (are
there nine?). But it seems we are stuck with this switching between
character sets that we know so well from the 7-bit era. Does the ISO
8859 have codes that tell the equipment to set the right version or
will we still do this by hand? Or should I learn postscript?

Lars Aronsson
     Aronsson@Lysator.LiU.SE

minow@mountn.dec.com (Martin Minow) (06/12/90)

In article <71@lysator.liu.se> aronsson@lysator.liu.se (Lars Aronsson) writes:
>
>I live in Sweden, where the national alphabet contains A-Z plus three
>"umlaut" letters. The Swedish version of the old 7-bit ISO-646 (called
>Swedish ASCII) has replaced the "[", "\", and "]" characters in ASCII
>for the national letters, as has many versions of ISO-646 in other
>European countries.  ...
>When ISO published the 8859 standard in 1989 (or 1988?), many of us
>thought this problem was solved once and for all. ...
> Northern Europe will use ISO 8859-1.

Well, not quite: for example, the Same (Lapp) languages require characters
that are not in ISO 8859-1.  This might also be the case for Irish and
Basque (I'm not completely certain.)

>
>Now, if we are lucky, all these letters are in ISO 8859-1 thru -9 (are
>there nine?). But it seems we are stuck with this switching between
>character sets that we know so well from the 7-bit era. Does the ISO
>8859 have codes that tell the equipment to set the right version or
>will we still do this by hand? Or should I learn postscript?

ISO 8859 (and allied standards) define escape sequences for switching between
character sets.  A recent Dec terminal programmer's guide (for, say, the
VT300 series) will describe them. As I recall, there are character sets
in the ISO 8859 family for the Slavic languages, including Cryllic, so, if
your terminal images the character set, you should be able to switch between
character sets with reasonable ease.  The important point about ISO 8859
is that switching need not be done for text shared between two dozen
languages, and is handled in a more coherent manner for other languages.

The Macintosh takes a somewhat different approach: all text on the Macintosh
is tagged with a font name (either implicitly or explicitly).  The font
definition contains imaging information: bitmaps for the screen and
Postscript for the printer (I'm simplifying somewhat).  There need be no
coorelation between the character code and any particular ISO/ASCII character.
For example, I developed a font for an internationally standardized symbol
font (for Orienteering) that groups symbols in a system completely different
from "ASCII."

Postscript, for that matter, uses an internal database to associate a
character code with a character name, and the name is associated with
the Postscript program that draws that shape (again, simplifying).  Thus,
if you can send Postscript directly to a printer, you can send "lower-case-
swedish-a-with-ring" to image that specific shape.

In the long-term, systems will migrate to a 32-bit character set (under
development as ISO 10646).  Here, too, escape sequences will be needed
to access subsets of the larger character set.

Martin Minow
minow@thundr.enet.dec.com
The above does not represent the position of Digital Equipment Corporation.

prc@erbe.se (Robert Claeson) (06/13/90)

In article <71@lysator.liu.se>, aronsson@lysator.liu.se (Lars Aronsson) writes:

> Now, if we are lucky, all these letters are in ISO 8859-1 thru -9 (are
> there nine?). But it seems we are stuck with this switching between
> character sets that we know so well from the 7-bit era. Does the ISO
> 8859 have codes that tell the equipment to set the right version or
> will we still do this by hand? Or should I learn postscript?

The ISO 2022 standard deals with character set switching within a
character stream.


-- 
Robert Claeson                  |Reasonable mailers: rclaeson@erbe.se
ERBE DATA AB                    |      Dumb mailers: rclaeson%erbe.se@sunet.se
                                |  Perverse mailers: rclaeson%erbe.se@encore.com
These opinions reflect my personal views and not those of my employer (ask him).

aronsson@lysator.liu.se (Lars Aronsson) (06/28/90)

prc@erbe.se (Robert Claeson) writes:

>As far as I know, ISO 2022 is mostly intended to be used for information
>*transfer*, not *processing*. Your word processor should convert the
>data read from the file into whatever internal format is appropriate
>for the operations you want to perform on it.

You are absolutely right. Transfer is the INTENT. Every new standard
has an quite limited scoop intended and as you stick to it everything
is fine.

RS-232 was intended to connect modems to terminals or modems to hosts.
But when you start connecting terminals to hosts, you get a lot of
trouble with the asymmetry of that standard (null modems etc).

Now, ISO 8859 is intended for sequential transfer of readable text.
Unfortunately, Sun Microsystems (and probably others aswell) will soon
(has already?) start to ship operating systems (SunOS 4.1), where ISO
8859 is used for internal storage, e.g. in text files. Under UNIX,
most common text editors (like emacs) do not scan the entire file to
be edited, but use random access.

Is it the manufacturers who are stupid as they use standards outside
their intended scoop or is it standard authors who are stupid as they
write standards with too narrow scoop? I think standards should be
versatile. I think EIA (CCITT) should have written a general,
symmetric standard for serial links rather than the RS-232 (CCITT
V.24). I think that ISO should have written a general standard for
storing and random order retrieval of text (that is, no escape
sequences) rather than the six versions of ISO 8859. Perhaps, I am
stupid.

Lars Aronsson
     Aronsson@Lysator.LiU.SE

minow@mountn.dec.com (Martin Minow) (06/28/90)

In article <106@lysator.liu.se> aronsson@lysator.liu.se (Lars Aronsson) writes:
>Is it the manufacturers who are stupid as they use standards outside
>their intended scoop or is it standard authors who are stupid as they
>write standards with too narrow scoop? ... I think that ISO should have
>written a general standard for storing and random order retrieval of text
>(that is, no escape sequences) rather than the six versions of ISO 8859.
>Perhaps, I am stupid.

No, I doubt that Lars is stupid; nor are the manufacturers or standard
writers.  The vast majority of text will be written in a single language
family, and, hence, in a single variant of ISO-8859 (which has more than
six versions, but that is irrelevant).  A text editor, to use Lars' example,
generally whould not see escape sequences in the embedded text (they would be
absorbed by the file/terminal read process), but would see a single 256 byte
dataspace (of which 94+95 are graphics).  Editors that do need to deal
with multiple ISO-8859 instances (say, an editor that must handle both
Swedish (ISO 8859-1) and Hebrew (ISO 8859-9) must establish its own
internal mechanism for this problem.  (In the one I wrote, each character
was represented by a 16-bit quantity that encoded both character and
character set designator.)

Manufacturers chose ISO 8859 (-1) in order to gain a consistent character
set representation among all applications and datafiles within a computer
system.  This seems to me to be a prefectly reasonable decision: having
multiple representations within a single system is, I can state from
experience, a mess.

The standard writers cannot, on the other hand, look outside the scope
of their "data transmission" standard.  For example, I might choose
to represent data within a system in a Huffman or Lempel-Ziv encoding:
as long as the external users see the ISO character set, I can claim
conformance to the standard without embarrasment.

There are, by the way, standards for storing and retrieval of text that
were developed by, among others, libraries and cancer registries.  Spending
a few hours in a good reference library will give you more standards
than you might wish.

Martin Minow
minow@bolt.enet.dec.com
The above does not represent the position of Digital Equipment Corporation

prc@erbe.se (Robert Claeson) (06/29/90)

In article <106@lysator.liu.se>, aronsson@lysator.liu.se (Lars Aronsson) writes:

> Now, ISO 8859 is intended for sequential transfer of readable text.

And so is ASCII (really).

> Unfortunately, Sun Microsystems (and probably others aswell) will soon
> (has already?) start to ship operating systems (SunOS 4.1), where ISO
> 8859 is used for internal storage, e.g. in text files. Under UNIX,
> most common text editors (like emacs) do not scan the entire file to
> be edited, but use random access.

I believe you've got it somewhat wrong here. ISO 8859 specifices a set of
character sets, divided into two 7-bit pages (of which the left page, ie
character codes 0-127) is the same thing as ASCII) for a total of 8 bits.

ISO 8859 specifies 65 control characters (the first 32 in each page plus
the last character in the left page). It does not, however, specify a
means for switching among various character sets. That is the intent of
ISO 2022. A data stream that is encoded using ISO 8859 can be accessed
in any random order. Each character has its distinct meaning. There are
no multi-byte sequences specified in ISO 8859, although they can be
constructed using sequences of control- and graphic characters (which is
what ISO 2022 uses).

However, any data stream that is encoding using a character set (including
the infamous 7-bit set ISO 646) and that uses ISO 2022 for switching
among various character sets, can only be accessed in sequential order.

-- 
Robert Claeson                  |Reasonable mailers: rclaeson@erbe.se
ERBE DATA AB                    |      Dumb mailers: rclaeson%erbe.se@sunet.se
                                |  Perverse mailers: rclaeson%erbe.se@encore.com
These opinions reflect my personal views and not those of my employer (ask him).

kkim@plains.UUCP (kyongsok kim) (06/29/90)

In article <106@lysator.liu.se> aronsson@lysator.liu.se (Lars Aronsson) writes:
:prc@erbe.se (Robert Claeson) writes:
:
:>As far as I know, ISO 2022 is mostly intended to be used for information
:>*transfer*, not *processing*. Your word processor should convert the
:>data read from the file into whatever internal format is appropriate
:>for the operations you want to perform on it.
:
:You are absolutely right. Transfer is the INTENT. Every new standard
:has an quite limited scoop intended and as you stick to it everything
:is fine.

I had the same question and read ISO documents very carefully several
times only to fail to find any clue.
I wonder if such an intent has been explicitly mentioned in the ISO
document?  Or is it just a guess?

Thanks.