[net.internat] The real work of internationalization

gnu@l5.uucp (John Gilmore) (10/14/85)

The issue of the 8th bit is not the real problem.  It's clear that all
the programs that hack the 8th bit will have to be rewritten.

The ideal objective is for the same binaries to run anywhere in the
world, in any font or language or currency or date/time format.
[For now, let's not get off into currency/date/time conversions, and
just talk about character set representation issues.]

What will cause a LOT of grief is fitting the large Asian character
sets in.  I saw a memo purported to come from somewhere in AT&T that
seemed to be a mix of realism and brain damage.  Some of the brain
damage included:

     *	a "long char" data type for C -- haven't they ever heard of "short"?

     *	No "locking" select-character-set codes embedded in data streams
	(like what you'd send to a terminal to enter the "graphics character
	set").  Instead, they had two different ways to encode extended
	character sets (beyond 8-bit), and a bit OUTSIDE THE DATA STREAM
	(eg in the inode of a disk file) that said which format a file
	was in.  The two formats were for places where 8-bit or >8-bit
	character sets were the norm.

I don't think either of those is a viable idea, but I'm not sure that
a single representation will suffice UNLESS there are locking character set
selections (so the first few bytes of your file would describe its
default character sets, if strange).  Once you open that can, various
other worms come out, like making sure those specs get propagated
when you cut and paste in an editor, etc.

It's quite a job when you realize that unless ALL the Unix utilities
process Asian characters as characters, the system will lose.  Any
volunteers to hack grep for 16-bit characters encoded in an 8-bit
data stream with case shifts?

Of course stdio would be modified to encode and decode the extended
character set, and that will do much of the work for us.  Maybe that
should be our first research project -- a public domain stdio that
defines a standard programming interface to 16-bit characters and a
standard datastream representation for them.

tmb@talcott.UUCP (Thomas M. Breuel) (10/15/85)

In article <191@l5.uucp>, gnu@l5.uucp (John Gilmore) writes:
> What will cause a LOT of grief is fitting the large Asian character
> sets in.  I saw a memo purported to come from somewhere in AT&T that
[...]
> It's quite a job when you realize that unless ALL the Unix utilities
> process Asian characters as characters, the system will lose.  Any

What do you mean? Most UN*X utilities are programming utilities,
and nobody is going to program in Chinese characters. And the demands of
Chinese and Japanese word processing are so utterly different that
a completely new kind of user interface and a completely new set of
utilities is needed anyhow (sort, grep, &c don't really make sense with
Kanji or are extremely tricky to do. And how do you propose does the
shell deal with Kanji? And should file names be allowed to have Chinese
characters in them???).

As I see it, the most straightforward solution to the 'internationalisation'
problem' is to leave the programming and system utilities alone (that
also means not to put vertical bars into your logname...) and to provide
special purpose word-processors for word-processing in your favourite
natural language. 

This need not be a complete departure from the nroff/troff style
word processing (to which I am, incidentally, very attached), but
could be some extension to your favourite editor that deals with
translating nroff escape sequences back and forth from your terminal's
representation.

Even if you managed to cook up an operating system which were capable
of dealing with all kinds of Asian characters at all levels, nobody in the
western hemisphere would want to, or even could, run it. In addition,
it would still have to be able to communicate with all these old
fashioned things like ARPA, BITNET, System V, VT100's &c.

						Thomas.

edwards@uwmacc.UUCP (mark edwards) (10/15/85)

In article <527@talcott.UUCP> tmb@talcott.UUCP (Thomas M. Breuel) writes:
>In article <191@l5.uucp>, gnu@l5.uucp (John Gilmore) writes:
>> What will cause a LOT of grief is fitting the large Asian character
>> sets in.  I saw a memo purported to come from somewhere in AT&T that
>[...]
>> It's quite a job when you realize that unless ALL the Unix utilities
>> process Asian characters as characters, the system will lose.  Any
>
>What do you mean? Most UN*X utilities are programming utilities,
>and nobody is going to program in Chinese characters. And the demands of
>Chinese and Japanese word processing are so utterly different that
>a completely new kind of user interface and a completely new set of
>utilities is needed anyhow (sort, grep, &c don't really make sense with
>Kanji or are extremely tricky to do. And how do you propose does the
>shell deal with Kanji? And should file names be allowed to have Chinese
>characters in them???).
>
>As I see it, the most straightforward solution to the 'internationalisation'
>problem' is to leave the programming and system utilities alone (that
>also means not to put vertical bars into your logname...) and to provide
>special purpose word-processors for word-processing in your favourite
>natural language. 
>
>Even if you managed to cook up an operating system which were capable
>of dealing with all kinds of Asian characters at all levels, nobody in the
>western hemisphere would want to, or even could, run it. In addition,
>it would still have to be able to communicate with all these old
>fashioned things like ARPA, BITNET, System V, VT100's &c.
>
>						Thomas.


  It seems to me that this is just the problem. Look at our Big Automobile
companies. A few years ago with the fabricated oil shortages the 
JAPANESE were the only ones to see the value in small cars. Now they
have a good percentage of OUR (the U.S.) market. Look at the stereo
market, the TV, VCR, CAMERA, ETC.... There are more JAPANESE and 
CHINESE then all those who natively speak English. 
  This attitude will continue the ORIENTAL invasion of our markets. I
agree finding solutions to the CHINESE character sets is a very difficult
problem. But stopping the ORIENTAL ( JAPANESE, KOREAN, TAIWANESE, CHINESE
and others that might exist) invasion in native markets should transcend
typical thinking approaches. We the computer people have the ability
to do this.
   Has anyone seen a JAPANESE word processor in action? They have 4 types
of characters: Kanji (the chinese characters), KANA ( KATAKANA and HIRA-
GANA alphabets of sorts), and they use a fair amount of English (and 
other foriegn words) words in their texts. Some of the important research 
for upcoming computer generations has to do with NATURAL LANGAUGE. Should
we just lay down and pass the wealth of our future generations over to the
EAST. 

   Another question: Don't these Unix Utilities output messages of some
sort in text- natural language. Would you have the rest of the world
learn and use English just because, WE the Americans (and other Western
countries) are so narrow minded that we will not consider other usages
of characters in our computers. COME ON !! This is net.international!! 

   After all, we Computer Scientists take the difficult problems, define
them and come up with viable solutions. Lets not pass off difficult
problems by just ignoring them. I can assure you the JAPANESE will not
because they can't.

							MARK
**********************************************************************

These views are solely my own and possibly reflect no one elses.

---
When given the choice of two evils, I always try the one I haven't tryed
before.
		    -- MAE WEST

long@ittatc.ATC.ITT.UUCP (H. Morrow Long [Systems Center]) (10/17/85)

> other foriegn words) words in their texts. Some of the important research 
> for upcoming computer generations has to do with NATURAL LANGAUGE. Should
							   ^^^^^^^^^
> we just lay down and pass the wealth of our future generations over to the
> EAST. 
> 

	I agreed with the message of this article but NATURAL LANGUAGE
	should begin at home.

-- 

				H. Morrow Long
				ITT-ATC Systems Center,
				1 Research Drive Shelton, CT  06484
				Phone #: (203)-929-7341 x. 634
	
path = {allegra bunker ctcgrafx dcdvaxb dcdwest ucbvax!decvax duke ittral
	milford mit-eddie psuvax1 qumix sii supai tmmnet yale}!ittatc!long

tmb@talcott.UUCP (Thomas M. Breuel) (10/17/85)

In article <1558@uwmacc.UUCP>, edwards@uwmacc.UUCP (mark edwards) writes:
>   This attitude will continue the ORIENTAL invasion of our markets. I
> agree finding solutions to the CHINESE character sets is a very difficult
> problem. But stopping the ORIENTAL ( JAPANESE, KOREAN, TAIWANESE, CHINESE
> and others that might exist) invasion in native markets should transcend
> typical thinking approaches. We the computer people have the ability
> to do this.

I am not sure I understand how incorporating foreign (oriental)
character sets into an operating system can help stop
'the oriental invasion' (if such a thing exists at all).
Why don't you elaborate.

>    Another question: Don't these Unix Utilities output messages of some
> sort in text- natural language. Would you have the rest of the world
> learn and use English just because, WE the Americans (and other Western
> countries) are so narrow minded that we will not consider other usages
> of characters in our computers. COME ON !! This is net.international!! 

The rest of the world is learning and using English. Personally,
I think English is far from ideal for a universal language, but
it was established by historical accident and not concious choice.

Any attempt at 'internationalising' UN*X is pretty much doomed
to fail. Likewise, any attempt at 'internationlising' programming
environments is doomed to fail. Symbols and identifiers in programming
languages are ususally mnemonically chosen words or abbreviations.
In UN*X, the name of a user program is at the same time an
identifier in other programs (shell scripts), and its output
serves both as a user interface and as input for other programs.
This is one of the main strenghts of the UN*X way of operating
system architecture (what? you mean it was designed???).

The only way to provide a user-friendly, nationalised interface
in UN*X is to write something which translates between the
UN*X names and identifiers and the language the user understands.
From personal experience, I can tell you, though, that most
foreigners prefer not to use such interfaces.

>    After all, we Computer Scientists take the difficult problems, define
> them and come up with viable solutions. Lets not pass off difficult
> problems by just ignoring them. I can assure you the JAPANESE will not
> because they can't.

The Japanese will not ignore the problem of how to represent their
language on their computers because they have to solve this problem
for their own good. If you are really into selling computers into the
Japanese market, then you should also concern yourself with this
problem. If you want to make competitive products for the American
market, you had better ignore it. The Chinese writing system
is a very special problem (for computers, not for people) and
demands a very special solution.

					Thomas.

gnu@l5.uucp (John Gilmore) (10/18/85)

In article <527@talcott.UUCP>, tmb@talcott.UUCP (Thomas M. Breuel) writes:
> What do you mean? Most UN*X utilities are programming utilities,
> and nobody is going to program in Chinese characters.

:-) Of course nobody in China would ever program in Chinese.  They'd just
learn English because it's the natural language for talking to computers.

> utilities is needed anyhow (sort, grep, &c don't really make sense with
> Kanji or are extremely tricky to do.

I don't claim to know how to do them, I just claim that in Japan
people will want to grep their text files, the same way we do.  And
they certainly do have a sorting order (if not more than one), as we do.

>                        And should file names be allowed to have Chinese
> characters in them???

Of course file names should have Chinese characters.  Why deny the
essential benefit of a file system (a way to organize data with names)
to the people who happen to speak and write in Chinese?  The file
system code really doesn't care what those bytes of name MEAN, it just
remembers name<->data correspondences.  (Certainly the code that
implements the file system and its utilities is currently making some
assumptions about file names, and those will need changing.  That's
what this group is for!)

> As I see it, the most straightforward solution to the 'internationalisation'
> problem' is to leave the programming and system utilities alone (that
> also means not to put vertical bars into your logname...)

Given the choice of buying a system that lets me use my *name* as my
login name, or one that forbids it, other things equal I know which one
I will buy...or design, build and sell.

>							     and to provide
> special purpose word-processors for word-processing in your favourite
> natural language. 

These already exist and are not the subject of this newsgroup.  We're
talking about extending all the benefits of Unix (I presume you think
Unix is a nice environment to work and play in, yes?) to people who
speak and write differently than you do in Murray Hill.  The local-language
word processor problem is pretty well licked, though some of the solutions
(eg Japanese) still cost many yen more than an American word processor.

tmb@talcott.UUCP (Thomas M. Breuel) (10/18/85)

In article <198@l5.uucp>, gnu@l5.uucp (John Gilmore) writes:
> :-) Of course nobody in China would ever program in Chinese.  They'd just
> learn English because it's the natural language for talking to computers.

I think that our design of programming languages has been influenced
strongly by our natural languages. The question of lexical
analysis doesn't really make sense in Chinese, for example.

Of course you could design a programming language that uses
Chinese characters as its terminal symbols. Given how primitive
the vocabulary of programming languages is, and how unrelated
the 'mnemonic names' are to the real-life meanings of the words,
it is hardly worth it, though. 

> These [special purpose word processors] already exist 
> are are not the subject of this newsgroup.  We're
> talking about extending all the benefits of Unix (I presume you think
> Unix is a nice environment to work and play in, yes?) to people who
> speak and write differently than you do in Murray Hill.

As I posted before, I think it is impossible to implement an internationl
user interface at a low level in UN*X because one of the strengths of
UN*X is that the user interface is identical with an interactive
programming language. If you change the user interface (i.e. change
the name of 'grep' to something else, all shell scripts using 'grep'
will break). Remember what trouble it caused when someone decided that
the extra space in 'date' output was ugly and did away with it?
This reliance upon fixed ouput formats, fixed names, &c is not
bad programming, but a logical consequence of the UN*X philosophy.

Now, that doesn't mean that you couldn't cook up a shell for UN*X
that encodes Chinese characters in ASCII (for file names) and 
translates between Japanese commands and 'English' commands, 
But once you change anything lower level than that, like the names of system
utilities or the output format of almost any program in UN*X, you
can forget about sharing software.

Altogether:

-- I doubt that a hybrid system that understands all character sets,
   all string orders, all national date and time conventions, &c. has
   a chance in the west, because of the overhead and cost involved.
   Maybe a system that can handle just all Roman character set based
   languages has a chance, although I even doubt that...

-- File exchange, program exchange, networking, or any other kind of
   communication between machines with different character sets is
   a nightmare and very likely not to work. Just the difference in
   byte order between the VAX and the PDP is causing lots of grief
   already.

Conclusions:

-- Before you start screwing around with UN*X, please make a backup
   copy so that there is still something working around when you are
   done (anyone have a copy of 4.1 :-). 

-- Of course, in an ideal world, everybody could sit down at his
   terminal, type to the computer in his natural language, and the
   computer would automatically do the rest. Now, I am not opposed
   to that idea, I would just like to hear more reasonable proposals
   of how to do it. And, honestly, I don't think that you can begin
   by hacking namei or by starting to put funny characters into
   your logname. If you are realistic, you have to come up with
   something that works on top of existing operating systems (shudder),
   if you are revolutionary, you have to present a completely new
   concept, but you can hardly call it UN*X anymore, as most of
   what makes UN*X a fast and efficient system to work with is
   intimately related to its data structure: the ASCII text file,
   composed of English alphabetic characters.

						Thomas.

ellis@spar.UUCP (Michael Ellis) (10/18/85)

>It seems to me that this is just the problem. Look at our Big Automobile
>companies. A few years ago with the fabricated oil shortages the 
>JAPANESE were the only ones to see the value in small cars.

    That viewpoint overlooks our own stupidity. We are to blame for our
    arrogant assumption that things would continue to favor the `American
    way' -- wasteful overconsumption and contemptuous misappraisal of
    foreign, especially noneuropean, nations. 
    
    Overbearing complacence is the deadliest symptom of the disease called
    being #1. 

>Now they have a good percentage of OUR (the U.S.) market. Look at the stereo
>market, the TV, VCR, CAMERA, ETC.... There are more JAPANESE and CHINESE
>then all those who natively speak English.  This attitude will continue the
>ORIENTAL invasion of our markets.

    Good for them!

    It only goes to show that the capitalist system might really work.
    
    The Chinese and Japanese are eager to understand our culture. How many
    Americans even care to learn the languages of our honorable competitors?
    
    Hubris provokes nemesis.

-michael

minow@decvax.UUCP (Martin Minow) (10/19/85)

Digital has sold a Japanese-language version of Ultrix (Unix) in Japan
for some time now.  It uses the VT80 Kanji terminal to display English,
Katakana (syllabic) or Kanji.  There are also Japanese-language versions
of other Dec operating systems.

IBM sells a Japanese-language version of the IBM-PC.

The VT80 terminal contains a built-in ROM with the most popular
Kanji representations.  It also has a RAM that can be down-line
loaded with other representations.  When the VT80 receives
an "unknown" Kanji, it sends an escape sequence to the host
computer -- which is interpreted by the terminal subsystem --
requesting a display representation of that character.

Martin Minow
decvax!minow

eugene@ames.UUCP (Eugene Miya) (10/23/85)

> 
> >Now they have a good percentage of OUR (the U.S.) market. Look at the stereo
> >market, the TV, VCR, CAMERA, ETC.... There are more JAPANESE and CHINESE
> >then all those who natively speak English.  This attitude will continue the
> >ORIENTAL invasion of our markets.
> 

Permit me to make a comment about this statement.  This does not deal with
internationalization directly, but it does deal with discussing this topic.

Fortunately, our site had a copy of the parent article, because the
above can easily be misconstrued.  I am talking about the use of loaded
words like "ORIENTAL invasion."  I realize the speaker has quite a
respect, but his words can easily be taken out of context as his entire
text was not quoted.

Recently, I was returning from a meeting in Montreal and behind me sat
two members from a major military weapons facility (to go unnamed) who
attended the same meeting.  They were discussing the supercomputer
race with a non-meeting passenger on the plane talking about how if
the US did not keep up with the Japanese, that US industry/defense,
everything would be at the mercy (too kind a word for genitals) of
the Japanese.  At which point, I turned around to join an what I thought
would be interesting discussion.  The speaker promptly shut up.

If you are going to discuss invasions, let us not forget that WWII
hysteria drove my relatives into detention centers and created
witch hunts in the 1950s.  Do we begin by turning the US into a
totalitarian state because of economic hardship?  Perhaps, I should
quit working for the US government because I might be suspected of
being an economic spy.  Perhaps, you want all those descended from
the Far East to jump off cliffs to remove all doubt?
I rarely like to think about "the color of my skin," but recent
protectionist attitudes has my guard up.  I don't regard my comments
as a flame, but rather a defense of civil liberties.

To repeat, I don't regard the above as a personal attack, but
everyone discussing internationalisation had best put a good
foot forward (or I should say hand) as this group is read around the
world.

--eugene miya
  NASA Ames Research Center
  {hplabs,ihnp4,dual,hao,decwrl,allegra}!ames!aurora!eugene
  emiya@ames-vmsb