[net.unix] International Unix

keld@diku.UUCP (Keld J|rn Simonsen) (07/22/85)

<>

A while ago there was some discussion in these groups on
international UNIX. I missed it due to faults in our news system.
(I still wonder why). Here is my two cents worth.

The EUUG is also having a Standards Commitee on International
UNIX. We are seeing forward to a cooperation with the /usr/group/UK
group. There is a meeting on this in connection with the EUUG Copenhagen
Conference scheduled to Thursday 12th September 1985.

As Leif Samuelson noted, some chars in what you think is ASCII,
but in reality is ISO 646-1983, are reserved for national use,
namely the twelve (12) chars:

             #$@[\]^`{|}~

Various European National Standardisation Boards have adopted
character representations different from ASCII on (in total)
all the abovenamed positions. So these should not be thought of
as generally useful for international software, any of these characters
will generate weird output at least in one major European area.

Yes, we need to be able to have variable names with these characters.
ANSI C does not allow this, but it allows a representation of nine of
the abovenamed chars in *trigraph* form: ?? is used as a lead-in to
define:

#       [       \       ]       ^       {       |       }       ~
??=     ??(     ??/     ??)     ??'     ??<     ??!     ??>     ??-       

$@` are not used (at the moment) in ANSI C.
Personally I do not like the choice of ? as lead-in char as it
is graphically quite dominating, maybe .. was better,
but the trigraph scheme is quite general and OK to me.
If we then could use the national chars in variable names, C could
become a quite useful programming language :-)

foust@gumby.UUCP (07/22/85)

> Yes, we need to be able to have variable names with these characters.
> ANSI C does not allow this, but it allows a representation of nine of
> the abovenamed chars in *trigraph* form: ?? is used as a lead-in to
> define:
> 
> #       [       \       ]       ^       {       |       }       ~
> ??=     ??(     ??/     ??)     ??'     ??<     ??!     ??>     ??-       
> 
> $@` are not used (at the moment) in ANSI C.
> Personally I do not like the choice of ? as lead-in char as it
> is graphically quite dominating, maybe .. was better,
> but the trigraph scheme is quite general and OK to me.
> If we then could use the national chars in variable names, C could
> become a quite useful programming language :-)

But just think what this would do for an international
obfuscated C contest!  Anybody want to translate this year's entries?

-- 
----------
John Foust
"I used to be disgusted, but now I'm just amused"

minow@decvax.UUCP (Martin Minow) (07/23/85)

Keld Joern Simonsen suggests, probably with tongue in cheek,
that C would be a useful programming languge if only European users
could use their full national character set in identifiers.

To my knowledge, no commercially available computer language --
including a few developed in Scandinavia such as Algol 60 (for
Trask and Besk), Algol-Genius (for the Datasaab machines) and
Simula (for Dec PDP10s) permit national letters in variable
names, so the marketplace hasn't exactly mandated their inclusion.

I would also point out that national replacement character sets
are being superseded by the Draft ISO/ANSI/ECMA 8-bit character
set called Latin 1.  Latin 1 has a unique representation for the
national letters of the major European languages and, once the
initial problems of going from a seven-bit character set to
an eight-bit set have been solved, should prove to be a much
simpler representation to deal with for international products.

Martin Minow (fil.kand. Stockholms Universitet)
decvax!minow

levy@ttrdc.UUCP (Daniel R. Levy) (07/25/85)

What do you do about the punctuation marks [], which are used in C to denote
arrays?  Wouldn't they come out screwy in some international ASCII dialects?
Something like char foo ??(100??) or what?
-- 
 -------------------------------    Disclaimer:  The views contained herein are
|       dan levy | yvel nad      |  my own and are not at all those of my em-
|         an engihacker @        |  ployer, my pets, my plants, my boss, or the
| at&t computer systems division |  s.a. of any computer upon which I may hack.
|        skokie, illinois        |
|          "go for it"           |  Path: ..!ihnp4!ttrdc!levy
 --------------------------------     or: ..!ihnp4!iheds!ttbcad!levy

trb@masscomp.UUCP (Andy Tannenbaum) (07/25/85)

I don't think it's necessary to hack up languages to allow funny
(uhm, international...) characters as variable names.  It is
important, though, to allow international character sets in the user
interface.  This is a totally different problem, and it's a problem
that Hewlett Packard seems to be addressing.  Within the past year,
there have been several articles in the HP Journal which address the
issues involved in the international software marketplace.  For
example, Wilson and Shaw, "Designing Software for the International Market,"
HP Journal Sept 1984 is an overview.  There have also been articles
which discussed sorting, hyphenation, spelling correction, maintaining
multi-language prompt-string databases, date formats, etc.

By the way, the HP Journal is free from HP, and often contains
interesting and timely information on their products and engineering.
It's nice to see a free publication which isn't useless.
I think you can get put on the list by mailing to

	HP Journal
	3000 Hanover Street
	Palo Alto, CA 94304 USA


	Andy Tannenbaum   Masscomp  Westford, MA   (617) 692-6200 x274

zap@ttds.UUCP (Svante Lindahl) (07/25/85)

["For you, for you, for you, I came for you" -- Bruce Springsteen, "For you"]

In article <93@decvax.UUCP> minow@decvax.UUCP (Martin minow) writes:
>Keld Joern Simonsen suggests, probably with tongue in cheek,
>that C would be a useful programming languge if only European users
>could use their full national character set in identifiers.
>
>To my knowledge, no commercially available computer language --
>including a few developed in Scandinavia such as Algol 60 (for
>Trask and Besk), Algol-Genius (for the Datasaab machines) and
>Simula (for Dec PDP10s) permit national letters in variable
>names, so the marketplace hasn't exactly mandated their inclusion.

The PDP-10 Simula compiler does allow the lowercase national
characters { (a w/ umlaut, :a), | (o w/ umlaut, :o) and }
(a with a circle on top, Oa).

>Martin Minow (fil.kand. Stockholms Universitet)
>decvax!minow

Svante Lindahl (fil.kand. Stockholms Universitet)

-- 
Svante Lindahl, NADA, KTH (Dept of Numerical Analysis and Computer Science 
			   at the Royal Institute of Technology)
UUCP:	{decvax,philabs,seismo}!{mcvax,ukc,unido}!enea!ttds!zap
ARPA:	mcvax!enea!ttds!zap@seismo.ARPA
or 	Svante_Lindahl_NADA%QZCOM.MAILNET@MIT-MULTICS.ARPA

rjh@ihlpa.UUCP (Randolph J. Herber) (07/31/85)

> >To my knowledge, no commercially available computer language --
> >including a few developed in Scandinavia such as Algol 60 (for
> >Trask and Besk), Algol-Genius (for the Datasaab machines) and
> >Simula (for Dec PDP10s) permit national letters in variable
> >names, so the marketplace hasn't exactly mandated their inclusion.
> (a with a circle on top, Oa).
> 
> >Martin Minow (fil.kand. Stockholms Universitet)
> >decvax!minow
> UUCP:	{decvax,philabs,seismo}!{mcvax,ukc,unido}!enea!ttds!zap

IBM PL/I does allow three "national alphabet" characters in variable
names: $ (dollar sign), @ (at sign), and # (pound or number sign).

Randolph J. Herber, Amdahl Senior Systems Engineer, 
   at AT&T Bell Labs, Naperville, IL, 312-979-6553
   or 800-843-7467 extension 1075

bilbo.jbrown@ucla-locus.ARPA (Jordan Brown) (10/24/85)

A couple of notes on the message from Erik Fair (ucbvax!fair):

Unfortunately, you CAN'T build a good international character set.
Some of those silly European countries have the same character in
several languages, but sort the character in different places in each
language.  They also have interesting constructs like characters that
sort as two characters, and pairs of characters that sort as single
characters.  That is, there might be a character @ which sorts as "xy",
so that @m sorts right after xylophone and before xyn.  Similarly, they
sometimes say that the pair ll sorts as a single character; I don't
remember where.

Character set is not (or should not be) a very basic assumption.
Aren't there EBCDIC UNIXes out there?  Most of the system is (should be)
completely independent of the character set.  The only place you should
have problems will be programs which make assumptions about arithmetic
on characters, or about the range of values characters take on.
(Note that C promises that all characters are non-negative (this is not to
say that all possible values of a char variable are non-negative, however))
What characters does the kernel (for instance) know and care about?
Slash (/), Null (\0), and maybe Dot (.) in the main body of the kernel;
a few control characters in the tty drivers.  No big deal.

There will be work, but it shouldn't be too bad.

Much more grunt work is involved in isolating the messages for translation.
People writing code commercially should keep this in mind.  Keep your
messages in a separate module, or better yet in an external file.  Try to
make the code flexible about exactly how long messages are; the length will
vary dramatically when you translate the message, and English is usually
the most terse language.

Wouldn't it be easier to convince the Europeans to speak English? :-)

gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (10/25/85)

Just a reminder:  international != European

fair@ucbarpa.BERKELEY.EDU (Erik E. &) (10/27/85)

The point about the kernel not having much ASCII dependent code is well
taken, however, I was thinking (and expounding upon) the whole of UNIX,
which most specifically includes the entire lot of ASCII ridden utility
programs. For an inaccurate survey (it will grossly underestimate the
number of programs that will have to be changed), do something like
this:

	cd /usr/src
	egrep -l 'include.*ctype.h' *[ch] */*[ch] | wc -l

This will give you the number of files that include `ctype.h' which
is explicitly ASCII dependent.

Someone else said that different European languages (we're ignoring the
orient for the moment) use the same glyph or letter in different order,
or with completely different meaning, and therefore an international
character set can't be done.

If that's the case, then each country will end up writing its own
version of UNIX based on the national character set (like the French
have done, and the Japanese are doing now).

The real goal we're shooting for is the international exchange of
information. There's nothing stopping the Europeans or the Japanese
from building their own computers on incompatible character sets. And
they do so. However, it is hard, slow and tedious to translate data
from a Japanese computer in (say) kanji, to English/ASCII.

I think that we're attacking the wrong problem, however. Instead of
attempting the technological solution by teaching computers some large
`n' number of languages, we should attack the basal cultural problem by
developing and widely teaching a common intermediate language.
Esperanto, anyone?

	Erik E. Fair	ucbvax!fair	fair@ucbarpa.BERKELEY.EDU

sambo@ukma.UUCP (Father of micro-ln) (10/30/85)

In article <2400@brl-tgr.ARPA> bilbo.jbrown@ucla-locus.ARPA (Jordan Brown) writes:
>Unfortunately, you CAN'T build a good international character set.
>Some of those silly European countries have the same character in
>several languages, but sort the character in different places in each
>language.  They also have interesting constructs like characters that
>sort as two characters, and pairs of characters that sort as single
>characters.  That is, there might be a character @ which sorts as "xy",
>so that @m sorts right after xylophone and before xyn.  Similarly, they
>sometimes say that the pair ll sorts as a single character; I don't
>remember where.
>

I guess I would like to see some examples of the above.  Are you saying
that in some language, the order of the letters might be "a b c ...",
whereas in some other language, the order might be "a c b ..."?  What
pair of languages is like this?  Also, in which language is some single
character considered as two characters?

I speak Spanish and some French.  Without thinking very much, something
like the double "l" (which at least in Honduras is pronounced the same
as a "y") would need to be treated as a single character, but written
out as two characters.  The problem is in capitalizing it.  There need
to be two forms for the uppercase double "l": "LL" and "Ll".  This would
mean that there would be two different codes for the uppercase double
"l".  Again, without thinking very much, this is the same situation as
with vowels, since they may have an accent.

Disclaimer: I am not an expert on International Unix.
--
Samuel A. Figueroa, Dept. of CS, Univ. of KY, Lexington, KY  40506-0027
ARPA: ukma!sambo<@ANL-MCS>, or sambo%ukma.uucp@anl-mcs.arpa,
      or even anlams!ukma!sambo@ucbvax.arpa
UUCP: {ucbvax,unmvax,boulder,oddjob}!anlams!ukma!sambo,
      or cbosgd!ukma!sambo

	"Micro-ln is great, if only people would start using it."

bilbo.jbrown@ucla-locus.ARPA (Jordan Brown) (10/30/85)

> gwyn@brl:
> international != European

true, but European is a subset of international.

> ucbvax!fair:
> grep 'ctype.h' *
> finds ASCII-dependent programs

Not true at all.  "isalpha", "isupper", and most of the others are explicitly
NOT ASCII dependent.  They exist to allow independence from ASCII.  Sure, they
are implemented in an ASCII-dependent way, but if you want to change the
charset, all you need to do is change ctype.h and the library routine(s) (if
any).  In fact, for one of the implementations of ctype.h, all you need to
do is to change a table of character types.

piet@mcvax.UUCP (Piet Beertema) (10/30/85)

	>Wouldn't it be easier to convince the Europeans to speak English? :-)
Far easier would it be to get all Americans to speak Dutch... :-)

-- 
	Piet Beertema, CWI, Amsterdam
	(piet@mcvax.UUCP)

andy@cheviot.uucp (Andy Linton) (10/31/85)

In article <864@mcvax.UUCP> piet@mcvax.UUCP (Piet Beertema) writes:
>
>	>Wouldn't it be easier to convince the Europeans to speak English? :-)
>Far easier would it be to get all Americans to speak Dutch... :-)
>
I agree with piet but....
Wouldn't Gaelic be a better choice - hardly anyone knows any so
we all start out equal (I have already started). There are only
sixteen characters in the alphabet with two extra symbols
(sineadh fada - it looks like the french acute accent and the
inclusion of the letter 'h' to indicate aspiration of the preceding
letter. We may even be able to reduce the number of bits in a byte!
All the Americans who claim Irish or Scottish extraction
will have an inherent ability to master this as it is part of
their unconscious folk heritage(:-).
I don't want my culture (Anglo-Irish) swamped by the American
one any more than the rest of the Europeans do. After all
'Live the difference' doesn't have the same ring to it in
English as in French.

Slainte mhaith,
Andy

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SENDER 	: Aindrias Mac Giolla Fhionntain	PHONE	: +44 632 329233
POST	: Computing Lab, University of Newcastle upon Tyne, UK, NE1 7RU
ARPA	: andy%cheviot.newcastle.ac.uk@ucl-cs.ARPA)
JANET	: andy@uk.ac.newcastle.cheviot
UUCP	: <UK>!ukc!cheviot!andy

***  Ni fui moran beagan d'aon rud, ach is fui moran beagan ceille.  ***

ds@warwick.UUCP (Douglas Spencer) (10/31/85)

Summary:
Expires:
Sender:
Followup-To:
Distribution:
Keywords:
Xpath: warwick snow snow ubu

>>Wouldn't it be easier to convince the Europeans to speak English? :-)
>Far easier would it be to get all Americans to speak Dutch... :-)
That's easier than teaching the Americans to speak *English*.. :-)
-- 
+-------------------------+-------------------------------+
| Douglas Spencer         | ..seismo!mcvax!ukc!warwick!ds |
|  Mathematics Institute  +-------------------------------+
|   University of Warwick | 'Stop youth! your thinking is |
|    Coventry             | muddy, turbid and confused !' |
|     CV4 7AL             | Who said it, what story ?     |
|      England            | please email, big prizes      |
+-------------------------+-------------------------------+

radzy@calma.UUCP (Tim Radzykewycz) (10/31/85)

In article <2344@ukma.UUCP> sambo@ukma.UUCP (Father of micro-ln) writes:
>In article <2400@brl-tgr.ARPA> bilbo.jbrown@ucla-locus.ARPA (Jordan Brown) writes:
>>Unfortunately, you CAN'T build a good international character set.
>>Some of those silly European countries have the same character in
>>several languages, but sort the character in different places in each
>>language.  They also have interesting constructs like characters that
>>sort as two characters, and pairs of characters that sort as single
>>characters.  That is, there might be a character @ which sorts as "xy",
>>so that @m sorts right after xylophone and before xyn.  Similarly, they
>>sometimes say that the pair ll sorts as a single character; I don't
>>remember where.

>I guess I would like to see some examples of the above.  Are you saying
>that in some language, the order of the letters might be "a b c ...",
>whereas in some other language, the order might be "a c b ..."?  What
>pair of languages is like this?  Also, in which language is some single
>character considered as two characters?

Basically, yes.  That's the general idea.  If you go through your archives
for net.nlang for about the last 3 or 4 weeks, you can get about 6 examples
of alphabets, at least two of which have "letters out of sequence".

One other way of looking at this (let's see how far ahead of myself
I can get) is to think of the reasons for the internaltional character
set:
    1. consistent sorting
    2. consistent pred/succ operations
    3. no special characters in one language that are printable chars in another

Well, reason 2 says we can't have gaps in the letters for *any* language.
Reason 3 says languages with smaller alphabets can't use the extra chars.
Reason 1 says everything has to be in order.

So lets take a look at 3 character sets (english, spanish and german)
a b c d e f g h i j k l  m n o p q r s t u v w x y z	<- english
a b c d e f g h i j k l ll m n o p q r s t u v w x y z	<- spanish
a b c d e f g h i j k l  m n o p q r s B t u v w x y z	<- german

(pardon me if any of this is wrong, but at least it makes the point,
even if it *is* wrong.)

So the letters (E:m-z,S:ll-z,G:m-z) are all different, and we're still
on the latin alphabet (How about cyrillic?).  

Aside:  I strongly recommend that anyone seriously interested in
international [issues|unix] read net.nlang.  It is not too difficult
to cull the garbage from it and read only the relevant articles, such
as the ones I mentioned above.  Please send flames to /dev/null and
discussions to me or the net.

>I speak Spanish and some French.  Without thinking very much, something
>like the double "l" (which at least in Honduras is pronounced the same
>as a "y") would need to be treated as a single character, but written
>out as two characters.  The problem is in capitalizing it.  There need
>to be two forms for the uppercase double "l": "LL" and "Ll".  This would
>mean that there would be two different codes for the uppercase double
>"l".  Again, without thinking very much, this is the same situation as
>with vowels, since they may have an accent.

I assume this is all an argument to support the original article, however
I don't think that was clear the way it was written.
-- 
Tim (radzy) Radzykewycz, The Incredible Radical Cabbage
	calma!radzy@ucbvax.ARPA
	{ucbvax,sun,csd-gould}!calma!radzy

radzy@calma.UUCP (Tim Radzykewycz) (11/01/85)

In article <864@mcvax.UUCP> piet@mcvax.UUCP (Piet Beertema) writes:
>
>>Wouldn't it be easier to convince the Europeans to speak English? :-)
>Far easier would it be to get all Americans to speak Dutch... :-)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>	Piet Beertema, CWI, Amsterdam
>	(piet@mcvax.UUCP)

True.  Look at what attempts to be English.  :-) :-)

How about:

Possible to get all Europeans to speak English, but those
Americans would never go for it.  :-) :-)

Maybe:

Why get the Europeans to speak English?  We never talk to
them anyway.  :-} :-}
-- 
Tim (radzy) Radzykewycz, The Incredible Radical Cabbage
	calma!radzy@ucbvax.ARPA
	{ucbvax,sun,csd-gould}!calma!radzy

gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (11/01/85)

> > international != European
> 
> true, but European is a subset of international.

So is Japanese and Chinese.  How are you going to
fix that by playing with the ASCII character set?

pete@kvvax4.UUCP (Peter J Story) (11/01/85)

In article <> sambo@ukma.UUCP (Father of micro-ln) writes:
  >that in some language, the order of the letters might be "a b c ...",
  >whereas in some other language, the order might be "a c b ..."?
  >What pair of languages is like this? 
Norwegian, Swedish which have three extra characters which you can't
represent on your terminal but on mine use the ASCII positions {|}
depending on the language.  In Norwegian it is as given above.  Swedish is
}{|.  And then there are Danish and Finnish, which I don't know offhand.

  >Also, in which language is some single character considered as two
How about the German character that looks like a beta, which is "ss" in the
nearest transliteration.  Or u with an umlaut diacritical mark, which at
least in some historical texts must sort as if it were ue.  Unless someone
in Germany corrects my too old knowledge.
-- 
Pete Story	      {decvax,philabs}!mcvax!kvport!kvvax4!pete
A/S Kongsberg Vaapenfabrikk, PO Box 25, N3601 Kongsberg, Norway
Tel:  + 47 3 739644   Tlx:  71491 vaapn n

dave@ecrcvax.UUCP (David Morton) (11/01/85)

Summary:
Expires:
References: <2400@brl-tgr.ARPA> <mcvax.864>
Sender:
Reply-To: dave@ecrcvax.UUCP (David Morton)
Followup-To:
Distribution:
Organization: European Computer-Industry Research Centre, Munchen, W. Germany
Keywords:

>       >Wouldn't it be easier to convince the Europeans to speak English? :-)
>Far easier would it be to get all Americans to speak Dutch... :-)
>
       You must be joking, they still cannot speak Gringlish properly :-
-- 

Dave Morton
Tel. (49) 89 - 92699 - 139

CSNET: dave%ecrcvax.uucp@germany.csnet
UUCP: decvax!mcvax!unido!ecrcvax!dave

rcd@opus.UUCP (Dick Dunn) (11/04/85)

>>>Wouldn't it be easier to convince the Europeans to speak English? :-)
>>Far easier would it be to get all Americans to speak Dutch... :-)
>That's easier than teaching the Americans to speak *English*.. :-)
If you guys keep it up, we'll bring the Australians into this just so that
we Americans will have someone to pick on...  :-)
-- 
Dick Dunn	{hao,ucbvax,allegra}!nbires!rcd		(303)444-5710 x3086
   ...Never attribute to malice what can be adequately explained by stupidity.

zben@umd5.UUCP (11/05/85)

>>>Wouldn't it be easier to convince the Europeans to speak English? :-)
>>Far easier would it be to get all Americans to speak Dutch... :-)
>That's easier than teaching the Americans to speak *English*.. :-)

Why don't we all learn Hebrew - that way if G*d does show up, we will be
able to talk to him(?).  Does anybody remember the control string to make
a Heathkit go into right-to-left mode?
-- 
Ben Cranston  ...{seismo!umcp-cs,ihnp4!rlgvax}!cvl!umd5!zben  zben@umd2.ARPA

zben@umd5.UUCP (11/05/85)

>>>Wouldn't it be easier to convince the Europeans to speak English? :-)
>>Far easier would it be to get all Americans to speak Dutch... :-)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>True.  Look at what attempts to be English.  :-) :-)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The really interesting thing is that it makes perfect sense to a literate
American reader.  (I know, roger, null set :-)  One feels a sense of oddness
while reading them, but the meaning is certainly clear enough.

The first would be perfect American English if written in the negative:

>>Wouldn't it be far easier to get <random predicate>

The second implies to me that the referent is actually alive and attempting
to pass itself off somehow!  This might be something like the English idiom
"on a plane" giving foreign readers a mental picture of riding on the
*outside* of the fuselage of the plane, or of asking for "the milk" to mean
"give me all the milk in creation".

Not to mention "Throw your father down the stairs his hat!"  :-)

-- 
Ben Cranston  ...{seismo!umcp-cs,ihnp4!rlgvax}!cvl!umd5!zben  zben@umd2.ARPA

khasin@hcrvx2.UUCP (Khasin Teow) (11/06/85)

this topic should have some interesting issues to discuss, and instead
has deterioted into a flurry of smart-alec comments!
i can't believe this! this is net.unix, not net.smartcomments.
there has been talks about high phone bills for netnews and my
company is talking about either cut off the news or reduce the number
of news group.
if these stupid comments keep pouring in, i can see that there is no
way i can support this news group. (flame off)

sorry folk, although i could purge all articles under this subject
easly, i didn't want to miss some "real" articles on this subject.