[comp.sys.amiga.advocacy] Announcement--new "Unicode" standard

torrie@cs.stanford.edu (Evan Torrie) (02/25/91)

es1@cunixb.cc.columbia.edu (Ethan Solomita) writes:

>In article <1902@public.BTR.COM> thad@public.BTR.COM (Thaddeus P. Floryan) writes:
>>In article <39545@cup.portal.com> Classic_-_Concepts@cup.portal.com writes:
>>>
>>>                    Unicode Standards -- News Bulletin
>>>                    ==================================
>>> [...]
>>
>>and neglected to mention the proposed 16-bits per character in the article
>>about Unicode that I read.
>>
>	What? You mean all those programs that depend on
>sizeof(char) being 1 are going to break! 8-)

>	Actually, there is no reason why a new 8-bit ASCII
>definition cannot cover most all of the accentuations based upon
>the English/French/German/... alphabet. I'm sure the people
>making those standards aren't truly dumb, but they will get no
>support for a 16 bit standard.

  An 8-bit ASCII can cover most of the English/French/German alphabet,
but that is not what the vendors behind Unicode are aiming for.  You
see, there's this thing called "market size"...  There happen to be
more people who speak Chinese than there are people who speak any
other language.
  That's a huge market (1.1 billion people and growing faster than any
other market), which few vendors are able to tap with their 8-bit
Roman-alphabet based systems.
  I'm glad ASCII is finally being phased out... it's just not
appropriate for today's international world.

-- 
------------------------------------------------------------------------------
Evan Torrie.  Stanford University, Class of 199?       torrie@cs.stanford.edu   
Today's maxim:  All socialists are failed capitalists

peter@sugar.hackercorp.com (Peter da Silva) (02/25/91)

The problem with the Unicode standard isn't that it's new, or will
require massive changes to existing software. The problem is that it's
fighting ISO 10646, a 32-bit standard with 16- and 8- bit subsets.

Just what we need, *two* standards for wide characters. Anyone remember
the ASCII versus EBCDIC wars?
-- 
Peter da Silva.   `-_-'
<peter@sugar.hackercorp.com>.

davidm@uunet.UU.NET (David S. Masterson) (02/25/91)

>>>>> On 24 Feb 91 18:01:48 GMT, cs326ag@ux1.cso.uiuc.edu (Loren J. Rittle) said:

Loren> In article <1991Feb24.164137.11897@cunixf.cc.columbia.edu>
Loren> es1@cunixb.cc.columbia.edu (Ethan Solomita) writes:

Ethan> In article <1902@public.BTR.COM> thad@public.BTR.COM (Thaddeus P.
Ethan> Floryan) writes:

Thad> and neglected to mention the proposed 16-bits per character in the
Thad> article about Unicode that I read.

Ethan> What? You mean all those programs that depend on sizeof(char) being 1
Ethan> are going to break! 8-)

Loren> Yes, I see the 8-), but I want to make sure everyone knows that in C,
Loren> sizeof(char) BY definition is 1. No if's but's or or's.

Ah, but there is a "but".  In LIMITS.H, there is a macro called CHAR_BIT which
gives the number of bits in the smallest possible data object (a CHAR).  It
must be *at least* 8 bits.  So, sizeof(char) is 1, but the number of bits in
the "1" can be greater than eight bits.

Ethan> Actually, there is no reason why a new 8-bit ASCII definition cannot
Ethan> cover most all of the accentuations based upon the
Ethan> English/French/German/...  alphabet. I'm sure the people making those
Ethan> standards aren't truly dumb, but they will get no support for a 16 bit
Ethan> standard.

Loren> That is kind of amusing, as we already have an 8-bit ASCII ISO standard
Loren> that covers English/French/German/... alphabets.  Ethan, the Amiga uses
Loren> this!  I think all these companies are just blowing smoke, I hope
Loren> nothing comes of it!

Yeah, but as someone else pointed out, there is a whole lot more to the world
than those people who write in the Roman languages.  The Japan and Russian
markets should suggest a huge market that is relatively untapped thus far.
Believing that these markets must write English in order to make use of our
computers is relatively self-centered, don't you think?  And waiting until the
market is well established before doing the development necessary would
probably mean that American computer companies just won't be players in that
market. 
--
====================================================================
David Masterson					Consilium, Inc.
(415) 691-6311					640 Clyde Ct.
uunet!cimshop!davidm				Mtn. View, CA  94043
====================================================================
"If someone thinks they know what I said, then I didn't say it!"

daglem@idt.unit.no (Dag Lem) (02/25/91)

In article <1991Feb24.220323.27961@cunixf.cc.columbia.edu>, es1@cunixb.cc.columbia.edu (Ethan Solomita) writes:
|> In article <1991Feb24.180148.21954@ux1.cso.uiuc.edu> cs326ag@ux1.cso.uiuc.edu (Loren J. Rittle) writes:
|> >In article <1991Feb24.164137.11897@cunixf.cc.columbia.edu> es1@cunixb.cc.columbia.edu (Ethan Solomita) writes:
|> 
|> >>	Actually, there is no reason why a new 8-bit ASCII
|> >>definition cannot cover most all of the accentuations based upon
|> >>the English/French/German/... alphabet. I'm sure the people
|> >>making those standards aren't truly dumb, but they will get no
|> >>support for a 16 bit standard.
|> >
|> >That is kind of amusing, as we already have an 8-bit ASCII
|> >ISO standard that covers English/French/German/... alphabets.
|> >Ethan, the Amiga uses this!  I think all these companies are
|> >just blowing smoke, I hope nothing comes of it!
|> >
|> 	By the German/..., the ... refers to things like the
|> Scandinavian languages, which add a tremendous number of
|> characters to the alphabet. There are also some unusual
|> characters in Eastern European langauges.
|> 
Get real! The scandinavian languages do not add a *tremendous* number of
characters to the alphabet! And we're all quite happy with the Amiga character
set, all our characters are there! Norwegian adds ae, oe, aa, Danish adds
nothing above that, Swedish adds another version of ae... Tremendous number?
Nah!

|> >Loren J. Rittle
|> >-- 
|> >``NewTek stated that the Toaster *would not* be made to directly support the
|> >  Mac, at this point Sculley stormed out of the booth...'' -A scene at the
|> >  recent MacExpo.  Gee, you wouldn't think that an Apple Exec would be so
|> >  worried about one little Amiga Device... Loren J. Rittle  l-rittle@uiuc.edu
|> 
|> 
|> 	-- Ethan
|> 
|> 
|> Q:	What's the definition of a Quayle?
|> 
|> A:	Two right wings and no backbone.

-- 
    __                /                       |> Shorter of breath
   /  \  __   __     /   ___  _ _             |> and one day closer to death.
  /    |/  / /  /   /   /__/ / / /            |>   -Roger Waters, Time 
 /____/ \_/|/\_/   /____\___/ /  \__________  |>    (Dark Side of The Moon) 
______________/

buzzard@eng.umd.edu (Sean Barrett) (02/25/91)

I think we have gotten far off from the Amiga here. *sigh*

So xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) says:
>
> Classic_-_Concepts@cup.portal.com writes a description of Unicode
> thad@public.BTR.COM (Thaddeus P. Floryan) points out it takes 16 bits
>
>Actually, 16 bits is insufficient; a comprehensive Chinese dictionary of
>ideographs contains over 100,000 individual ones, and that is just one
>language's alphabet.

If the goal is portably *transmitting* characters, you can use 16 or 32
bits freely, making sure that the bit patterns you use don't have any
"naughty" 8/7-bit ASCII values in them.  If you don't need to be able
to intermix or auto-detect what language something is, this is sufficient.

> Ethan (deleted the ref, oops) is sure the people making the standards
>    aren't dumb, but that they'll get no support for 16 bit.
>
>Actually, the ASCII standard has extension mechanisms for ever larger
>byte sizes; the use of 7 bit ASCII is trememdously parochial of the US.

A trivial extension to ASCII (isn't this *obvious*?!!) if you need 32768
new characters is to throw out parity, leave ASCII 0-127 alone, and
on ASCII 128-255 fetch the next character and combine them, giving
15 bits of new characters (or you could fetch 3 and have 31 bits, or
you could have 128-191 fetch one character for 14 bits, 192-255 fetch
3 characters more for 30 bits).

peterk@cbmger.UUCP (Peter Kittel GERMANY) (02/26/91)

In article <1991Feb25.090953.18672@Neon.Stanford.EDU> torrie@cs.stanford.edu (Evan Torrie) writes:
>thad@public.BTR.COM (Thaddeus P. Floryan) writes:
>
>>Think back to why the metrification of the USA failed ... most of the WORLD's
>>production is in the USA based on OUR standard (crappy though the "English"
>>system of weights and measures may be).
>
>  Why did the metrification of the USA fail?  Since coming here 6
>months ago, it's one of the things that has really annoyed me... Why
>on earth would you cling to a system where below freezing means less
>than 32 degrees?  What possible reason is there for not changing to a
>Celsius temperature system?
>  I, for one, would like to see a push towards metrification in the
>US.

The version I heard about metrification is USA said that they chose
a different way of metric measures than Europe (or rest of the world).
So these so-called metric parts manufactured in USA still were not
compatible with those made in Europe! I find this silly, that they
didn't jump on the wagon but had to do their own (new, non-standard)
way, pretending it were standard and was indeed not. 

-- 
Best regards, Dr. Peter Kittel  // E-Mail to  \\  Only my personal opinions... 
Commodore Frankfurt, Germany  \X/ {uunet|pyramid|rutgers}!cbmvax!cbmger!peterk

eachus@aries.mitre.org (Robert I. Eachus) (02/26/91)

     As some one who has to deal with international standards, and has
had to look at the existing and proposed ISO standards, let me add a
few facts to this discussion...

     Seven-bit ASCII (yes, 7) is the American version of the ISO-646
character set.  There are ten characters in the ISO-646 set know as
"national use" characters which can be defined differently by
different national standards organizations.  (These include the
{ASCII} characters []$ etc.  Over thirty of these national sets have
been defined.

     ISO 2022 is a standard for combining two active 7-bit (actually 95
character) sets in a standard 8-bit format, with control characters to
switch active sets.  The control character sets are currently defined
in ISO 6429.  ISO 2022 also allows two-byte character sets such as
Japanese to be embedded in a one-byte stream.

     The various one-byte character sets to use with ISO 2022 are
defined in ISO 8859.  All have the ANSI assignments in the lower half,
and combinations of national charater sets in the upper page.  These
include:
 
          Part 1   Latin-1   Western Europe (except Iceland)
          Part 2   Latin-2   Eastern Europe
          Part 3   Latin-3   Southern Europe
          Part 4   Latin-4   Northern Europe
          Part 5   Latin/Cyrillic
          Part 6   Latin/Arabic
          Part 7   Latin/Greek
          Part 8   Latin/Hebrew
          Part 9   Latin-5   Western Europe (variation)

     The most commonly used of these is Latin-1 which corresponds to
the labels in FED (if you set high to 255 :-).

     Now we get to the big stuff.  There is currently a draft ISO
standard 10646, which includes every known character from every
language in the world (with LOTS of room for expansion).  It is known
as MOCS (for multi-octet character set) and each unique character (or
variant such as capitialized, bold, or underlined) has a unique 32 bit
designation.  ISO 10646 can be represented as streams of one, two,
three, or four octets using control characters, and there is a
two-octet subset which contains all of the characters from the ISO
8859 sets, a lot of other small character sets, and a (9025 charater)
page each of Japanese, Chinese, and Korean.  This 16-bit set set, with
escapes, looks likely to become the standard multi-lingual interchange
format.

     So the Amiga world is "up" to the current 8859 standards, as well
as supporting ISO 646 style national sets, and could theoretically
support full 10646 (if you had enough memory :-).  The Amiga doesn't
currently support 2022 style character set mixing as far as I know
(that's all right, almost no one else does either).  Individual
programs such as Notepad and some DTP programs support mixed fonts,
but I don't think any of them conform to 2022.
--

					Robert I. Eachus

     Our troops will have the best possible support in the entire
world.  And they will not be asked to fight with one hand tied behind
their back.  President George Bush, January 16, 1991

ltf@ncmicro.lonestar.org (Lance Franklin) (02/26/91)

In article <1991Feb25.091818.19027@Neon.Stanford.EDU> torrie@cs.stanford.edu (Evan Torrie) writes:
}es1@cunixb.cc.columbia.edu (Ethan Solomita) writes:
}>What may be logical is
}>reserving certain 8-bit ASCII codes to mean that the next byte is
}>a letter in language FOO. That would probably cause the least
}>inconvenience.
}
}  I don't know whether this would be workable... There are something
}like 10000? distinct ideographs in the Asian languages, so you need
}around 13-14 bits to store them all.  This wouldn't leave you enough
}space to store the rest of normal ASCII. 
}  I imagine there would be some sort of compromise to allow old
}programs to run unchanged under a 16-bit code.

This should not be too difficult...by merely reserving 64 ascii codes
out of 256 as extended code, signifying that the next byte is a
character in one of 64 extended sets, you allow for 16384 extended
characters.  In a Japanese system that I worked with once, the standard
character set contained both standard english symbols and single
width Japanese characters, with the extended set all being double
width.  This had the added advantage that the string length of the
output string was also the number of character positions that the
output took up on the screen.  Of course, for ease of use, you'd
probably want to disallow the use of a zero-value byte for the second
byte, so you'd actually be talking about 64*255 extended characters,
or 16320 characters.

By the way, on this system, programs did run unchanged on either 8 or
16 bit systems.  The display hardware actually did the hard work.  On
an non-Japanese system, the strings were output as mostly junk
characters, with occasional english numbers or words interspersed.
In text modes, the display hardware did the correct mapping to standard
or extended character set, and I imagine that the display BIOS did
something similar when drawing text in a graphics mode.

Now, the big problem in this deal is not displaying the codes...that's
the easy part.  The hard part is figuring out stuff like string
functions in C, sorting routines...and how to handle languages like
Hebrew, where the text is displayed right-to-left (If memory serves).
In addition, if I'm not mistaken, a character's look may change based
on the characters that preceed and/or follow it.

All in all, a nasty little problem.

Lance
-- 
Lance T. Franklin            +----------------------------------------------+
(ltf@ncmicro.lonestar.org)   | "You want I should bop you with this here    |
NC Microproducts, Inc.       |    Lollipop?!?"                 The Fat Fury |
Richardson, Texas            +----------------------------------------------+

thad@public.BTR.COM (Thaddeus P. Floryan) (02/26/91)

In article <1991Feb26.004323.17914@zorch.SF-Bay.ORG> xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes:
> thad@public.BTR.COM (Thaddeus P. Floryan) writes:
>Stupid jingoism helps no one.
>[...]
>
>> Sheesh. :-)
>
>Indeed.

Thanks for leaving at least one of my ":-)" in your reply!  :-)

There have been some most-interesting replies in this "thread" already,
but I haven't seen any mention of "trigraphs".  Am I going blind?

Seriously, this topic does NOT belong in *.advocacy and had it been in a
more-appropriate forum my original reply (retort) wouldn't have been so
sarcastic.  Sadly, Julie (who started this thread and is still at PORTAL)
isn't able to see ANY of this because PORTAL holds incoming Amiga stuff in
suspense for 7 days, then "loads" it, then immediately deletes it due to the
7-day expiration; that stupid policy change resulted from the comp.sys.amiga.*
re-org with PORTAL's contention the changed hierarchy is "new" with no tenure
and thus lower than whale turds; why'd you think I switched to BTR as a primary
with some other systems as "backup"?  :-)

I don't see any distribution problems for Mac, Apple II, or IBM newsgroups
at PORTAL (which, by the way, is situated in Computertino CA, Apple's stomping
grounds).

THAT is an "advocacy" issue: PORTAL's handling of the Amiga newsgroups, a
new "cause celebre" (HEY, the irony: I *need* the Unicode to properly accent
that phrase! :-)

Thad Floryan [ thad@btr.com (OR) {decwrl, mips, fernwood}!btr!thad ]

eachus@aries.mitre.org (Robert I. Eachus) (02/27/91)

In article <37010002@hpfcdc.HP.COM> koren@hpfcdc.HP.COM (Steve Koren) writes:

   (A lot of good information from the Unix viewpoint skipped, as you
may have noticed, I am involved mostly with the ISO standards.)

  > In the Unix world, there has been support for 16 bit character sets
  > for many years now...

   Except in filesystem names!  A number of vendors are trying to also
support 16-bit character set filesystems (with tar files etc., totally
unportable back to 8-bit filesystems).  But a few years ago, I only
knew of one company (Stratus) trying for full ISO 2022 support (mixed
1, 2, and 3 byte characters) in the names of files.  Although the
fille names were still limited to, I think, 30 bytes in canonical
form, the switchover really slowed down a number of filesystem
primitives.

   Do you know of anyone else who has done a good (and compatible)
wide character set filename system?  An interesting thought is that it
might not be too difficult to build an Amiga file system (and matching
shell) which provided support for say JIS X0208 or some other 16 bit
set...

--

					Robert I. Eachus

     Our troops will have the best possible support in the entire
world.  And they will not be asked to fight with one hand tied behind
their back.  President George Bush, January 16, 1991

david@twg.com (David S. Herron) (02/28/91)

In article <1906@public.BTR.COM> thad@public.BTR.COM (Thaddeus P. Floryan) writes:
>In article <1991Feb24.175313.10206@Neon.Stanford.EDU> torrie@cs.stanford.edu (Evan Torrie) writes:
>> [..crap deleted...]
>>  I'm glad ASCII is finally being phased out... it's just not
>>appropriate for today's international world.
>
>Get real, guy.  MOST of the computer languages are based on the American
>language (in some shape, way or form, cryptic though some may be :-)

So?

>Next thing we know, you're gonna be proposing an Urdu or Sanskrit version
>of C or some other language.

There's support (limited) (in the ANSI C standard) for using odd
character sets to write C.

>Think back to why the metrification of the USA failed ... most of the WORLD's
>production is in the USA based on OUR standard (crappy though the "English"
>system of weights and measures may be).

Look.. this "We're American and superior to everybody else" attitude
has to stop.  Just 'cause we speek an odd form of English, and we all
live in luxury (compared to most of the world that is) doesn't us or
our way of life any better than anybody else.

In the case of computer & the character sets used to communicate
with computers ... What kind of gall do we have to say that the
limited character set in ASCII is what *EVERYONE* is supposed to
use?!? The job of computers is *NOT* to force us to do things the
way the computer wants it done, but is to help *US* with our own
job.  By telling the Germans & Scandinavians they can't put umlauts
and that "TH" character into their text we're in effect saying that
they have to bow and scrape to every whim of the computer.

Put yourself in the place of one of those people.  They've been
putting all them weird accent marks on their characters all their
life.  Now they start using computers & can't put 'em there.  In
French there are a few words for which the accent marks are what
distinguishes the two words (The two forms of "la" for instance),
and I am sure that this is true for other languages.  Telling them
that they cannot use those accent marks is a Rude, Crude and Socially
Obscene form of societal domination.

See my ".signature" for a bowing & scraping I am making in regards
to trying to reproduce a French word in ASCII.  I would *DEARLY*
love to have that look right ...

>Sheesh.  :-)

At least you put a :-)y in there ... ;-)

>Thad Floryan [ thad@btr.com (OR) {decwrl, mips, fernwood}!btr!thad ]

(BTW: There is "standards work" going on to extend electronic mail
 to be able to contain more than "just" ASCII text.  There is already
 X.400 which contains all that stuff, but RFC-822 is in the process
 of being extended to be able to contain multiple body parts & other
 character sets than ASCII.)

	David

-- 
<- David Herron, an MMDF & WIN/MHS guy, <david@twg.com>
<- Formerly: David Herron -- NonResident E-Mail Hack <david@ms.uky.edu>
<-
<-	MS-DOS ... The ultimate computer virus.

s37732v@puukko.hut.fi (Markus Aalto) (02/28/91)

   >>	Actually, there is no reason why a new 8-bit ASCII
   >>definition cannot cover most all of the accentuations based upon
   >>the English/French/German/... alphabet. I'm sure the people
   >>making those standards aren't truly dumb, but they will get no
   >>support for a 16 bit standard.
   >
   >That is kind of amusing, as we already have an 8-bit ASCII
   >ISO standard that covers English/French/German/... alphabets.
   >Ethan, the Amiga uses this!  I think all these companies are
   >just blowing smoke, I hope nothing comes of it!
   >
	   By the German/..., the ... refers to things like the
   Scandinavian languages, which add a tremendous number of
   characters to the alphabet. There are also some unusual
   characters in Eastern European langauges.

Well the Finnish and swedish characters fit very nicely to 8-bit
ASCII!
We are very lucky to have a such an operating system as Amiga has.
Switching between different keymaps is just so easy!

--


***********************************************************************
*  Markus Aalto           |       Only Amiga makes it possible!!!!    *
*  s37732v@puukko.hut.fi  |       Yeah! It's a sure thing!            *
*  maalto4@otax.hut.fi    |       :^)                                 *
***********************************************************************

goodwinm@prism.cs.orst.edu (GOODWIN MICHAEL LEE) (03/01/91)

In article <S37732V.91Feb28110704@puukko.hut.fi> s37732v@puukko.hut.fi (Markus Aalto) writes:
>
>   >>	Actually, there is no reason why a new 8-bit ASCII
>   >>definition cannot cover most all of the accentuations based upon
>   >>the English/French/German/... alphabet. I'm sure the people
>   >>making those standards aren't truly dumb, but they will get no
>   >>support for a 16 bit standard.
>   >
>   >That is kind of amusing, as we already have an 8-bit ASCII
>   >ISO standard that covers English/French/German/... alphabets.
>   >Ethan, the Amiga uses this!  I think all these companies are
>   >just blowing smoke, I hope nothing comes of it!
>   >
>	   By the German/..., the ... refers to things like the
>   Scandinavian languages, which add a tremendous number of
>   characters to the alphabet. There are also some unusual
>   characters in Eastern European langauges.
>
>Well the Finnish and swedish characters fit very nicely to 8-bit
>ASCII!
>We are very lucky to have a such an operating system as Amiga has.
>Switching between different keymaps is just so easy!
>

I agree with you.  Now if only someone would standardize 8-bit ASCII like 7-bit
ASCII is, then they would not need a 16-bit character set, save maybe Asian 
characters...

bernie@metapro.DIALix.oz.au (Bernd Felsche) (03/02/91)

I suppose, to put this all into perspective, I should paraphrase what
I read recently in UNIX Review:

	Standards are for boring things!

Don't recall who wrote it, but he's right.
-- 
Bernd Felsche,                 _--_|\   #include <std/disclaimer.h>
Metapro Systems,              / sale \  Fax:   +61 9 472 3337
328 Albany Highway,           \_.--._/  Phone: +61 9 362 9355
Victoria Park,  Western Australia   v   Email: bernie@metapro.DIALix.oz.au

dfrancis@tronsbox.xei.com (Dennis Heffernan) (03/02/91)

	I'm just trying to envision the SHIFT KEY COMBINATIONS we're going
to need to TYPE all these wonderful characters...

	"Double bucky, you're the one..."

	:-)


dfrancis@tronsbox.xei.com    Dennis Francis Heffernan	GEnie: D.HEFFERNAN1
------------------------------------------------------------------------------
"Look...I sympathize with your problem, but all things considered, I'd rather 
talk about me."  --Murphy Brown

sschaem@starnet.uucp (Stephan Schaem) (03/05/91)

Control sequence are available! Why not have both?
There is not mutch to say.Just that if you dont 65536 possibility 
There is not mutch to say.Just make it available and dont use it 
if you dont need it.
Escape CIS is available why not use it!

sschaem@starnet.uucp (Stephan Schaem) (03/05/91)

 Well if you have more than 100,000 sign with 32 coding and a bitmap
 corespondance with a resolution of 32x32 (some sign are complex and at
 that point why not use 32 instead of 24:-).
 Well you just need 12 meg 'fonts'... of course they would be
 compacted:-)
 1024x1024x2 monitor will give you a 32x32 matrix, but 20x12 on 'normal'
 pc High resolution.
 I dont think really that you can make something really afordable for
 that 1 billion market share.
 That kind of thing is not to accepted as standart, and is curently used
 for buplishing and in that case they can do whatever they want!