[comp.emacs] US PC programmers still live in a 7-bit world!

newsuser@LTH.Se (Lund Institute of Technology news server) (06/23/88)

US PC programmers still live in a 7-bit world!

But we don't!
_____________


Yes, there  i s  intelligent life outside the USA. We even live
in an 8-bit world, which must come as a shocking piece of news
to some of you.

All right, you get a lot of credit for producing lovely programs
like uEmacs, Picnix, Tcless, and the like. They are lovely, yes, but
just to a certain extent, because they are completely useless
in Europe!

Why? Well, we are sure the intelligent reader already grasps the
reason. Take a look at the IBM PC character code set  a b o v e
ASCII 127. Our alphabet is there, too, and you just can't imagine
what funny results your tools yield when encountering them.

For instance, these are letters:

        Lower case      Upper case  |   Lower case      Upper case
        129             154         |   145             146
        130              -          |   147              -
        131              -          |   148             153
        132             142         |   149              -
        133              -          |   150              -
        134             143         |   151              -
        135             128         |   152              -
        136              -          |   160              -
        137              -          |   161              -
        138              -          |   162              -
        139              -          |   163              -
        140              -          |   164             165
        141              -          |

So, if your pet program is to become our pet, too, you have
to rethink concerning using the 8th bit as a flag, you have
to rewrite toupper, tolower, word scan, delete word, word
counters and the like.

Then you, as well, will discover the joys in the real, 8-bit
world!

Have fun.

-- 
Torsten Olsson, Dept of Comp Sc, Lund University, Box 118, S221 00 Lund, Sweden
Phone: +46-46104930 (work), +46-46126768 (home)    Bitnet: lthlib@seldc52
Internet: torsten@dna.lth.se   or   torsten%dna.lth.se@uunet.uu.net
UUCP: {uunet,mcvax}!enea!dna.lth.se!torsten

bobc@killer.UUCP (Bob Calbridge) (06/25/88)

In article <1988Jun22.223158.1366@LTH.Se>, newsuser@LTH.Se (Lund Institute of Technology news server) writes:

> Why? Well, we are sure the intelligent reader already grasps the
> reason. Take a look at the IBM PC character code set  a b o v e
> ASCII 127. Our alphabet is there, too, and you just can't imagine
> what funny results your tools yield when encountering them.
> So, if your pet program is to become our pet, too, you have
> to rethink concerning using the 8th bit as a flag, you have
> to rewrite toupper, tolower, word scan, delete word, word
> counters and the like.

:-) But then that's why, as you included in your missive it's
called ASCII.  The A stands for American.  Is there there
such as thing as ISCII??? 

I can just imagine the loops and whirls you would have to
go through to write a viable sort routine to include new
alpha sorts.  And who's to determine what the order of sort
would be.  I'll see ya at the international conference where
we can hash this out.  :-]

Just for fun.
Bob

wnp@dcs.UUCP (Wolf N. Paul) (06/25/88)

In article <1988Jun22.223158.1366@LTH.Se> torsten@DNA.LTH.Se (Torsten Olsson) writes:
>US PC programmers still live in a 7-bit world!
>
>But we don't!
>
>Yes, there  i s  intelligent life outside the USA. We even live
>in an 8-bit world, which must come as a shocking piece of news
>to some of you.
>Take a look at the IBM PC character code set  a b o v e
>ASCII 127. Our alphabet is there, too, and you just can't imagine
>what funny results your tools yield when encountering them.
>
>So, if your pet program is to become our pet, too, you have
>to rethink concerning using the 8th bit as a flag, you have
>to rewrite toupper, tolower, word scan, delete word, word
>counters and the like.

As a native German-speaker I am aware of the problem Torsten refers to.
However, it is a bit more complex than his posting implies.

The C functions toupper()/tolower() rely on upper and lower case to be
two parallel groups of consecutive codes within the ASCII scheme. The way
IBM has chosen to implement non-English characters on its PC line is

(a) non-standard (i.e. applies only to IBM-compatible machines) and

(b) incompatible with the assumption about upper and lower case in ASCII
	and thus in C and other programming languages.

(c) incompatible with the way European characters are implemented in MOST
	printers and ANSI terminals.

The way IBM implemented it, all case functions would have to be table-driven,
which is much less elegant than working with the parallel ranges of characters
in standard ASCII.

So all of you Europeans should lobby hardware manufacturers to implement
foreign characters in an intelligent way, and in a STANDARD WAY across
different architectures, and THEN you can reasonably expect the authors
of compilers and libraries and tools to support these characters.

-- 
Wolf N. Paul * 3387 Sam Rayburn Run * Carrollton TX 75007 * (214) 306-9101
UUCP:     killer!dcs!wnp                 ESL: 62832882
DOMAIN:   wnp@dcs.UUCP                   TLX: 910-380-0585 EES PLANO UD

caf@omen.UUCP (Chuck Forsberg WA7KGX) (06/26/88)

In article <1988Jun22.223158.1366@LTH.Se> torsten@DNA.LTH.Se (Torsten Olsson) writes:
:US PC programmers still live in a 7-bit world!
:
:But we don't!
:_____________
:
:
:Yes, there  i s  intelligent life outside the USA. We even live
:in an 8-bit world, which must come as a shocking piece of news
:to some of you.

Very strange, since the requests for ZMODEM extensions to deal with
7 bit transmission paths come mostly from European PSVAN users.  The
major US PSVANS (Telenet, Tymnet, CIS, etc.) all support 8 bits!


Chuck Forsberg WA7KGX          ...!tektronix!reed!omen!caf 
Author of YMODEM, ZMODEM, Professional-YAM, ZCOMM, and DSZ
  Omen Technology Inc    "The High Reliability Software"
17505-V NW Sauvie IS RD   Portland OR 97231   503-621-3406
TeleGodzilla BBS: 621-3746   CIS: 70007,2304    Genie: CAF

P.S.: ZCOMM and Professional-YAM do support 8 bit character sets.

tj@gpu.utcs.toronto.edu (Terry Jones) (06/27/88)

There is a difference between communicating and character sets. read the
whole message before posting a reply like that.

neitzel@infbs.UUCP (Martin Neitzel) (06/28/88)

(Here comes yet anotherone of those European bastards... :-)

torsten@DNA.LTH.Se (Torsten Olsson) writes:
TO>	US PC programmers still live in a 7-bit world!
TO>	[...]

In article <126@dcs.UUCP> wnp@dcs.UUCP (Wolf N. Paul) replied:
WNP>	[...]
WNP>	The C functions toupper()/tolower() rely on upper and lower case to be
WNP>	two parallel groups of consecutive codes within the ASCII scheme.

Basically Right.  But one of the reasons for those macros/functions
is to get independent of the used character set, or am I completely
wrong?  (I know, the UNIX(tm) <ctype.h> requires us to write something
like "if (isascii(c) && islower(c) ...", but that should be relatively
easy to port into an 8-bit environment.  On the other hand, if someone
thinks:  "Hey, 7 bits in my char for the ascii code, now let's see what
I can mess around with the 8th!" -- that's neither portable nor justified
by K&R.

[This is the point where I should mention that discussions about
 our European character sets are usually somewhat emotionally
 heated  (probably because we always had kind of to "suffer"
 from ASCII here in Europe).
 None of the following replies are intended to attack Wolf Paul
 personally.  But some points should be made clear.
]

WNP>	The way	IBM has chosen to implement non-English characters on its
WNP>	PC line is
WNP>	(a) non-standard (i.e. applies only to IBM-compatible machines) and

Then perhaps we can move to ISO Latin-X.  That would still confront us
with the same problems:  non-ASCII, 8-bit.  (For more on the standards
issue, see below.)

WNP>	(b) incompatible with the assumption about upper and lower case in ASCII
WNP>	    and thus in C and other programming languages.

K&R did not made any assumptions about required character sets, except
there must be at least one positive element contained.  So far regarding
"C & ASCII".

WNP>	(c) incompatible with the way European characters are implemented
WNP>	    on MOST printers and ANSI terminals.

Perhaps I should explain what "this way" was: The ascii characters like
[]{}\~ were considered as "not so useful for Europeans" and their codes
were interpreted as national characters.  That was a HACK, not a
solution!  Yes, usage of those ascii charcaters and national characters
was mutually exclusive.  Not "both of them" in one document, listen?

[btw, that's one of the reasons why I write in English when programming,
 and not in my native language (and the language the people that want to
 read my programs here would like to read.)]

Wolf is right: IBM did not constrain itself to any standard when
introducing their PCs.  But they made a first step into a reasonable
direction:
(1) Keeping 0-127 ASCII, and
(2) Providing most of the Europeans with "their" characters.

They certainly broke no previous standard.  The previous practice
as explained above was not a standard.  As I said, it was a hack, and
nobody ever before did care about finding a solution.


WNP>	The way IBM implemented it, all case functions would have to be
WNO>	table-driven, which is much less elegant than working with the
WNP>	parallel ranges of characters in standard ASCII.

Why is the table-driven approach "much less elegant"?  A syntax-table
ala GNU's is easy to implement, costs few memory, simplifies and speeds
up the code, offers classifications beyond "letter/upper/lower" in a
simple way, is easy to re-configure at run time. (I think, for an editor
a table-driven implementation is the only way to go.  But let's keep to
the 8-bit issue.  [Btw, the character classification problem is
one of Jon Bentleys primary examples for data-driven-programming
in his "Programming Pearls".])

WNP>	So all of you Europeans should lobby hardware manufacturers to
WNP>	implement foreign characters in an intelligent way, and in a
WNP>	STANDARD WAY across different architectures, and THEN you can
WNP>	reasonably expect the authors of compilers and libraries and
WNP>	tools to support these characters. 

The Intelligent Way will be some ascii extended to eigth bits. 

It maybe hard to accept for Americans that ASCII is not
sufficient as an international code for information interchange
(and information processing).  A change has to be made and that
may hurt one or the other, resulting in emotionally heated
reactions on the American side.  At least ASCII has been proven
to be useful for some decades now, and large systems depend on
it, including UNIX(tm).

Authors of compilers, libraries, tools, and programming languages
have begun to at least consider foreign character sets.  ANSI C
is the current popular example.  While their trigraphs are just
an superflous wart on a wart in my opinion, their concept of
"locales" is just the thing we all think of.  Your <ctype.h> for
you, mine for me.

In the meantime, please re-read Torsten Olssons article.  All he
asks for is to respect all bits in a character variable as
pertaining to the character code itself.  Don't mess around with
the eigth, ninth, or whatever bit of it, please.

We Europeans don't want to prohibit you Americans to write
programs using the ASCII code anymore, don't get me wrong.  If
we want to extend your programs for our needs we will probably
manage that as we did until now.  Don't break your head with
writing sorting routines for non-ASCII character sets.
Thank you.

							Martin

[PS: I have to admit: I am frightened of the day I have to write
programs respecting Japanese/Chinese character sets, too.  These
poor fellows have to suffer from ASCII most, I think.]

wnp@killer.UUCP (Wolf Paul) (06/29/88)

In article <920@infbs.UUCP> neitzel@infbs.UUCP (Martin Neitzel) writes:
 >(Here comes yet anotherone of those European bastards... :-)

Nun, so schlimm sind wir doch nicht!

 >torsten@DNA.LTH.Se (Torsten Olsson) writes:
 >TO>	US PC programmers still live in a 7-bit world!
 >TO>	[...]
 >In article <126@dcs.UUCP> wnp@dcs.UUCP (Wolf N. Paul) replied:
 >WNP>	[...]
 >WNP>	The C functions toupper()/tolower() rely on upper and lower case to be
 >WNP>	two parallel groups of consecutive codes within the ASCII scheme.
 >Basically Right.  But one of the reasons for those macros/functions
 >is to get independent of the used character set, or am I completely
 >wrong?  (I know, the UNIX(tm) <ctype.h> requires us to write something
 >like "if (isascii(c) && islower(c) ...", but that should be relatively
 >easy to port into an 8-bit environment.  On the other hand, if someone
 >thinks:  "Hey, 7 bits in my char for the ascii code, now let's see what
 >I can mess around with the 8th!" -- that's neither portable nor justified
 >by K&R.

I fully agree with you there. My point is that if we are going to use the
8-bit character set, and if, like IBM, we call it "extended ASCII",
let's extend it consistently. 

In standard ASCII, uppercase alphabetics are codes 65-90 (Hex 41-5A),
and lowercase alphas are codes 97-122 (Hex 61-7A). Thus, by adding 32
(Hex 20) to an uppercase character, I can convert it to lower case, and
by subtracting the same amount from a lower case character, i can convert
it to an uppercase character.

If IBM's character set were really "extended ASCII", this would work for
the non-English, 8-bit characters, as well. It doesn't ...

Of course, IBM is not alone to blame. When I worked with Apple // computers,
they would sell a computer in Germany or Austria which could be switched
between German and English, but if you moved that computer to France, or
even to Switzerland, you'd be stuck with some common characters which your
computer didn't produce. They used the eighth bit as a "flashing" attribute.

 > None of the following replies are intended to attack Wolf Paul
 > personally.  But some points should be made clear.

Don't feel attacked. I'm European myself.

 >
 >WNP>	The way	IBM has chosen to implement non-English characters on its
 >WNP>	PC line is
 >WNP>	(a) non-standard (i.e. applies only to IBM-compatible machines) and
 >
 >Then perhaps we can move to ISO Latin-X.  That would still confront us
 >with the same problems:  non-ASCII, 8-bit.  (For more on the standards
 >issue, see below.)

I'd be very interested if you or someone else could send me a table of
ISO Latin-X.

 >WNP>	(b) incompatible with the assumption about upper and lower case in ASCII
 >WNP>	    and thus in C and other programming languages.
 >
 >K&R did not made any assumptions about required character sets, except
 >there must be at least one positive element contained.  So far regarding
 >"C & ASCII".

But in practice most compiler writers assume ASCII, probably because 
computer manufacturers claim that their machines support ASCII -- albeit
"extended".

 >WNP>	(c) incompatible with the way European characters are implemented
 >WNP>	    on MOST printers and ANSI terminals.
 >
 >Perhaps I should explain what "this way" was: The ascii characters like
 >[]{}\~ were considered as "not so useful for Europeans" and their codes
 >were interpreted as national characters.  That was a HACK, not a
 >solution!  Yes, usage of those ascii charcaters and national characters
 >was mutually exclusive.  Not "both of them" in one document, listen?

I agree with that assessment.

 >[btw, that's one of the reasons why I write in English when programming,
 > and not in my native language (and the language the people that want to
 > read my programs here would like to read.)]

One reason I avoid my native language (same as yours) when dealing with
computers and programming are the unbelievable kludges which seem to have
entered it. I can't get myself talking or writing about "Fietschers" of
a system. (I picked that one our of "Chip" a couple of years ago).

 >Wolf is right: IBM did not constrain itself to any standard when
 >introducing their PCs.  But they made a first step into a reasonable
 >direction:
 >(1) Keeping 0-127 ASCII, and
 >(2) Providing most of the Europeans with "their" characters.
 >
 >They certainly broke no previous standard.  The previous practice
 >as explained above was not a standard.  As I said, it was a hack, and
 >nobody ever before did care about finding a solution.

But they missed the mark when they failed to group upper and lower case
characters together in a manner similar to standard ASCII.

 >WNP>	The way IBM implemented it, all case functions would have to be
 >WNO>	table-driven, which is much less elegant than working with the
 >WNP>	parallel ranges of characters in standard ASCII.
 >
 >Why is the table-driven approach "much less elegant"?  A syntax-table
 >ala GNU's is easy to implement, costs few memory, simplifies and speeds
 >up the code, offers classifications beyond "letter/upper/lower" in a
 >simple way, is easy to re-configure at run time. (I think, for an editor
 >a table-driven implementation is the only way to go.  But let's keep to
 >the 8-bit issue.  [Btw, the character classification problem is
 >one of Jon Bentleys primary examples for data-driven-programming
 >in his "Programming Pearls".])

Because of the harebrained way IBM assigned characters to the eight-bit
positions, you would actually need two tables -- one to go from upper
to lower case, and another vice-versa. THAT's inelegant.

 >The Intelligent Way will be some ascii extended to eigth bits. 
 >
 >It maybe hard to accept for Americans that ASCII is not
 >sufficient as an international code for information interchange
 >(and information processing).  A change has to be made and that
 >may hurt one or the other, resulting in emotionally heated
 >reactions on the American side.  At least ASCII has been proven
 >to be useful for some decades now, and large systems depend on
 >it, including UNIX(tm).
 >
 >Authors of compilers, libraries, tools, and programming languages
 >have begun to at least consider foreign character sets.  ANSI C
 >is the current popular example.  While their trigraphs are just
 >an superflous wart on a wart in my opinion, their concept of
 >"locales" is just the thing we all think of.  Your <ctype.h> for
 >you, mine for me.

I basically agree, but with todays's terminals and printers, you
need a different ctype.h for your screen and for your printer.

 >
 >In the meantime, please re-read Torsten Olssons article.  All he
 >asks for is to respect all bits in a character variable as
 >pertaining to the character code itself.  Don't mess around with
 >the eigth, ninth, or whatever bit of it, please.

With reference to C compilers, I don't think there's any funny business
going on with the eighth bit. They just don't support the IBM non-English
characters as alphabetics, because they don't fit into the ASCII scheme,
even if one extended it to eight bits.

 >We Europeans don't want to prohibit you Americans to write
 >programs using the ASCII code anymore, don't get me wrong.  If
 >we want to extend your programs for our needs we will probably
 >manage that as we did until now.  Don't break your head with
 >writing sorting routines for non-ASCII character sets.

Now that would truly have to be localized on a per-language basis,
because different languages using the same character shapes don't
necessarily sort them in the same sequence. So we won't bother with
it over here -- I'll wait until I get back to Europe, and then see
which part I've landed in before I mess with sorting :-).

 >[PS: I have to admit: I am frightened of the day I have to write
 >programs respecting Japanese/Chinese character sets, too.  These
 >poor fellows have to suffer from ASCII most, I think.]

Well, that won't fit into 8 bits, anyway.
-- 
Wolf N. Paul * 3387 Sam Rayburn Run * Carrollton TX 75007 * (214) 306-9101
UUCP:   killer!dcs!wnp                    ESL: 62832882
DOMAIN: wnp@dcs.UUCP                      TLX: 910-380-0585 EES PLANO UD

karl@haddock.ISC.COM (Karl Heuer) (06/30/88)

This discussion belongs in comp.std.internat; I'm moving it there.

In article <4635@killer.UUCP> wnp@killer.UUCP (Wolf Paul) writes:
>In standard ASCII, uppercase alphabetics are codes 65-90 (Hex 41-5A),
>and lowercase alphas are codes 97-122 (Hex 61-7A). Thus, by adding 32
>(Hex 20) to an uppercase character, I can convert it to lower case, and
>by subtracting the same amount from a lower case character, i can convert
>it to an uppercase character.
>
>If IBM's character set were really "extended ASCII", this would work for
>the non-English, 8-bit characters, as well. It doesn't ...

This would probably be a good idea, simply on the grounds that (if there's no
other reason to prefer one ordering over another) it would make certain
operations faster.  But in any case, portable programs won't take advantage of
such coincidences, except through portable interfaces such as toupper().

What about characters like German eszet, which have no uppercase equivalent?
This has the property islower(c), but toupper(c) != (c-0x20) no matter how you
arrange the character set.  (In ANSI C, toupper(c) returns the argument
unchanged if it can't be uppercasified.)

>[neitzel@infbs.UUCP (Martin Neitzel) writes:]
>>[On most European printers and terminals] The ascii characters like
>>[]{}\~ were considered as "not so useful for Europeans" and their codes
>>were interpreted as national characters.  That was a HACK, not a
>>solution!  Yes, usage of those ascii charcaters and national characters
>>was mutually exclusive.  Not "both of them" in one document, listen?
>
>I agree with that assessment.

So do I.  But it does have the advantage that (a) it allows the +0x20 hack to
work, and (b) it yields a contiguous alphabet, so MIN_LETR <= c <= MAX_LETR
works (which is probably more often assumed true than the +0x20 idiom).

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

wcs@skep2.ATT.COM (Bill.Stewart.<ho95c>) (06/30/88)

In article <4635@killer.UUCP> wnp@killer.UUCP (Wolf Paul) writes:
: >WNP>	The way IBM implemented it, all case functions would have to be
: >WNP>	table-driven, which is much less elegant than working with the
: >WNP>	parallel ranges of characters in standard ASCII.
: >Why is the table-driven approach "much less elegant"?  A syntax-table
:Because of the harebrained way IBM assigned characters to the eight-bit
:positions, you would actually need two tables -- one to go from upper
:to lower case, and another vice-versa. THAT's inelegant.

But you have to do the equivalent now, in the sense of
	islower(c) ? toupper(c) : c 	vs.
	isupper(c) ? tolower(c) : c 
(System V uses a function call that does this)

When I'm doing this kind of work, I almost always use a table -
TOUPPER(c) or TOLOWER(c)
	- it always does the right thing, even if the character
		wasn't upper or lower case.  If you *must* have
		specific-case character results, it's easy to get.
	- it only takes one step, so it's efficient
	- it only references the character once, so macros are safe
	- it works on BSD and System V
	- it's extensible to non-ASCII character sets
	- if you have shared libraries it doesn't waste much space.
-- 
#				Thanks;
# Bill Stewart, AT&T Bell Labs 2G218, Holmdel NJ 1-201-949-0705 ihnp4!ho95c!wcs

frisk@rhi.hi.is (Fridrik Skulason) (07/01/88)

In article <920@infbs.UUCP> neitzel@infbs.UUCP (Martin Neitzel) writes:
>
>                                         On the other hand, if someone
>thinks:  "Hey, 7 bits in my char for the ascii code, now let's see what
>I can mess around with the 8th!" -- that's neither portable nor justified
>by K&R.
>

Agree!

Another related problem is:

    "Well - nobody uses the 8th bit anyway, so .....   c &= 0x7f;"

I have even run into a couple of C compilers that assume this, MEGAMAX C
for the ATARI ST, and the C compiler for the Archimedes. This of course
means that I advise students here at the university NOT to buy the latter
machine.

Sometimes one cannot ignore the problem like this. For example, I had
to patch PC/NFS some days ago, since the terminal emulator stripped
off the 8th bit. (And while doing it I added automatic translation from
the PC character set to/from ISO 8859/1)

Not so serious maybe, but still boring, are programs that produce a warning
message if any character has the 8th bit set. (PC-Kermit for example)

>
>WNP>	(c) incompatible with the way European characters are implemented
>WNP>	    on MOST printers and ANSI terminals.
>
>Perhaps I should explain what "this way" was: The ascii characters like
>[]{}\~ were considered as "not so useful for Europeans" and their codes
>were interpreted as national characters.

Here in Iceland this used to be true until a few years ago. Then it was
decided to standardize on ISO 8859/1 (or "ECMA" code, as it was known then)

Now we are using ISO 8858/1 on our VAXes (instead of DEC multinational)
and on our HPs (instead of Roman8). Even the ATARI ST computer uses the
ISO standard (instead of a not-quite-IBM-PC-compatible character set with
hebrew extensions)

So, now we can use [,],{,},|,\,@,^,~ and ` together with our national
characters. The same goes in all other European countries - we are
moving away from modified 7 bit ASCII to standardized 8 bit character
sets. 

>
>Wolf is right: IBM did not constrain itself to any standard when
>introducing their PCs.  But they made a first step into a reasonable
>direction:
>(1) Keeping 0-127 ASCII, and
>(2) Providing most of the Europeans with "their" characters.

Most of them, yes, but not all. Not, for example, all the Icelandic,
Danish, Norwegian or Portuguese characters. They corrected that on the
PS/2 series, by providing the CP-850 character set, which includes all the
printable characters in ISO 8859/1 (Latin-1).

Unfortunately, they are not in the same positions as in the ISO standard.

>
>WNP>	The way IBM implemented it, all case functions would have to be
>WNO>	table-driven, which is much less elegant than working with the
>WNP>	parallel ranges of characters in standard ASCII.
>

So let's just use a character set like ISO 8859/1, where the European
special characters are also (mostly) in pallalel ranges.

One problem with parallel ranges is that some characters may only
exist as lower case, like the german ess-tzet. (position DF in ISO 8859/1)
(the character in position FF is y with two dots above, which is normally
not used in upper case.)

>
>WNP>	So all of you Europeans should lobby hardware manufacturers to
>WNP>	implement foreign characters in an intelligent way, and in a
>WNP>	STANDARD WAY across different architectures, and THEN you can
>WNP>	reasonably expect the authors of compilers and libraries and
>WNP>	tools to support these characters. 
>
>The Intelligent Way will be some ascii extended to eigth bits. 
>

Not just "some ascii" - ISO 8859/1 (or /2 /3 /4)

>
>Authors of compilers, libraries, tools, and programming languages
>have begun to at least consider foreign character sets.  ANSI C
>is the current popular example.  While their trigraphs are just
>an superflous wart on a wart in my opinion, their concept of
>"locales" is just the thing we all think of.  Your <ctype.h> for
>you, mine for me.

The discussion about trigraphs in comp.lang.c some time ago looked
somewhat silly to me, since some people were arguing that we (Europeans)
needed those, where in fact we do not. After all, who cares if 

             #define

looks like

             <pound sign>define

if it functions equally.

-- 
         Fridrik Skulason          University of Iceland
         UUCP  frisk@rhi.uucp      BIX  frisk

     This line intentionally left blank ...................

pinard@odyssee.UUCP (Francois Pinard) (07/01/88)

In article <699@omen.UUCP>, caf@omen.UUCP (Chuck Forsberg WA7KGX) writes:
> In article <1988Jun22.223158.1366@LTH.Se> torsten@DNA.LTH.Se (Torsten Olsson) writes:
> :US PC programmers still live in a 7-bit world!
> :Yes, there  i s  intelligent life outside the USA.
> The major US PSVANS (Telenet, Tymnet, CIS, etc.) all support 8 bits!

We sometimes have the feeling that 8-bit support is mainly for binaries.

(postnews requires me
to give more new text,
even if useless to do
so...)
-- 
Francois Pinard    pinard@odyssee.uucp                       (514) 279-0716
`Vivement GNU!'    pinard%odyssee@larry.mcrcim.mcgill.edu    (514) 588-4656

wnp@dcs.UUCP (Wolf N. Paul) (07/02/88)

In article <1288@odyssee.UUCP> pinard@odyssee.UUCP (Francois Pinard) writes:
 >(postnews requires me
 >to give more new text,
 >even if useless to do
 >so...)

I am posting this because Francois is not the only one who needs to be
reminded of this.

Actually, if you read the stuff in news.announce.newusers, you are not supposed
to give more new text, but if your reply must be shorter than the quoted text
(i.e. if you have to include more quoted text than the length of your own 
reply), you should "hide" the old text (i.e. fool postnews) by either changing
the quote character from '>' to something else, or else, putting a space in 
front of each quote character as I have done above.

The purpose of the "more new text than included quote" is to REDUCE VOLUME;
to add garbage new text has exactly the opposite effect. Use common sense,
folks!
-- 
Wolf N. Paul * 3387 Sam Rayburn Run * Carrollton TX 75007 * (214) 306-9101
UUCP:     killer!dcs!wnp                 ESL: 62832882
DOMAIN:   wnp@dcs.UUCP                   TLX: 910-380-0585 EES PLANO UD

nwd@j.cc.purdue.edu (Daniel Lawrence) (07/02/88)

	There has been a lot of talk about problems with 8 bit
characters in various programs and editors.  I had many comments about
this problem in the last couple of years about MicroEMACS 3.9 (although
it was 3.6 and 3.7 at the time) and I fixed it to handle transparently 8
bit characters quite some time ago.  So, entering high byte characters
from the keyboard using the alternet key works properly. Also keys can
be set up like the following:

store-macro-21
	insert-string &chr 130
!endm
bind-to-key execute-macro-20 FN^R

	This causes an e with an acute accent to be inserted when the
alt-e combination is struck.

	My problem now is this.... How do I determine how such a
character should be treated when converted to uppercase.  Not knowing
the different languages involved, could someone in the know post such
info?  Does it vary from language to language?  How are characters like
this treated on UNIX machines of different sorts.  Rather than bemoaning
the USA's programmers lack of attention, could someone tell us what is
the right way to handle things like this?

			Daniel Lawrence		(317) 742-5153
			UUCP:	{ihnp4!pur-ee!}j.cc.purdue.edu!nwd
			ARPA:	nwd@j.cc.purdue.edu
			FIDO:	1:201/2 The Programmer's Room (317) 742-5533

madd@bu-cs.BU.EDU (Jim Frost) (07/03/88)

In article <126@dcs.UUCP> wnp@dcs.UUCP (Wolf N. Paul) writes:
|The C functions toupper()/tolower() rely on upper and lower case to be
|two parallel groups of consecutive codes within the ASCII scheme.

Keep in mind that the implementations of isupper() and islower() often
are table driven and do not make this assumption.  Also, toupper() and
tolower() are sometimes as simple as:

#define toupper(A) (A - ('a' - 'A'))

and other times are complex enough to check for upper/lowercaseness.
I'd just implement my own that handled the extra characters.  Who says
that you have to use the routines supplied with your compiler?

jim frost
madd@bu-it.bu.edu

bcw@rti.UUCP (Bruce Wright) (07/04/88)

In article <4581@killer.UUCP>, bobc@killer.UUCP (Bob Calbridge) writes:
> In article <1988Jun22.223158.1366@LTH.Se>, newsuser@LTH.Se (Lund Institute of Technology news server) writes:
> 
> > Why? Well, we are sure the intelligent reader already grasps the
> > reason. Take a look at the IBM PC character code set  a b o v e
> > ASCII 127. Our alphabet is there, too, and you just can't imagine
> > what funny results your tools yield when encountering them.
> > So, if your pet program is to become our pet, too, you have
> > to rethink concerning using the 8th bit as a flag, you have
> > to rewrite toupper, tolower, word scan, delete word, word
> > counters and the like.
> 
> :-) But then that's why, as you included in your missive it's
> called ASCII.  The A stands for American.  Is there there
> such as thing as ISCII??? 
> 
> I can just imagine the loops and whirls you would have to
> go through to write a viable sort routine to include new
> alpha sorts.  And who's to determine what the order of sort
> would be.  I'll see ya at the international conference where
> we can hash this out.  :-]
> 
I realize that Bob was just trying to be funny, but even if you live in
the US you may want the international characters.  If you are trying
to correspond with people outside the English-speaking world (or sometimes
even within the English-speaking world), it can be >>EXTREMELY<< useful
to have the true characters rather than just their approximation in ASCII.
This is especially true if you are using letters (you know, paper and ink
and all that) rather than the network - your recipient will not appreciate
misspellings which are "unavoidable" on your text editor.

It's fairly easy to write a table to convert lower case to upper case
for any given character set.  Unfortunately the IBM set does not correspond 
to the ISO character set nor to the DEC multinational character set - this 
inhibits the portability of both files and programs.  

It would also be nice if the IBM-PC were set up to make >>GENERATING<< the
characters easier -- the DEC terminals and PC's are much nicer in this
respect.  It is much easier to remember "Compose-e-'" than it is to remember
"Alt-1-3-0"  (not to mention fewer keystrokes).  Unfortunately even on
the DEC PC's most editors don't deal with the multinational set well (though
they do on some of the larger DEC systems).

You do have a real problem with sorts though - even though any GIVEN collating
sequence is easy (just a table lookup), you will find that there is no
UNIVERSAL collating sequence for all the "special" characters.  For example,
Spanish considers "Ll" a special character sequence and collates it separately
from the English sequence "Ll".  I suspect that you would have to have a
number of translation tables.  But that would be a relatively minor problem -
at least for our purposes, we are more interested in generating text for
printing than we are in sorting anything.

						Bruce C. Wright

daveb@geac.UUCP (David Collier-Brown) (07/04/88)

In article <7350@j.cc.purdue.edu> nwd@j.cc.purdue.edu.UUCP (Daniel Lawrence) writes:
[...]
>	This causes an e with an acute accent to be inserted when the
>alt-e combination is struck.
>
>	My problem now is this.... How do I determine how such a
>character should be treated when converted to uppercase.  Not knowing
>the different languages involved, could someone in the know post such
>info?  Does it vary from language to language?  

  Yes, it does change.
  For example, french accented characters in France have their
accents dropped when converted to uppercase.  French accented
characters in Canada retain their accents when converted to
uppercase.
  This poses problems for a unilingual Anglo...

How are characters like
>this treated on UNIX machines of different sorts.  Rather than bemoaning
>the USA's programmers lack of attention, could someone tell us what is
>the right way to handle things like this?

  Do a literature search, once you become aware there is a problem.
[Programmers usually get shortchanged in university, by the way, and
often do not hear about literature searches until they hit grad
school, where the professors **freak** to discover that the student
doesn't have this basic skill. If you've been shortchanged, descend
on your local library's reference desk and have them point you at
their holdings on the subject --dave]

--dave (thats funny, there's no-one in the office today) c-b
-- 
 David Collier-Brown.  {mnetor yunexus utgpu}!geac!daveb
 Geac Computers Ltd.,  | "His Majesty made you a major 
 350 Steelcase Road,   |  because he believed you would 
 Markham, Ontario.     |  know when not to obey his orders"

billp@infmx.UUCP (Bill Potter) (07/05/88)

In all this discussion about international standards for 7 bit / 8bit
characters I've missed any reference to the X-OPEN groups Native
Language Support specification. It seems to me that everything anyone
could want in terms of internationalisation is covered in this spec.,
and as it will now, hopefully, be implemented by the major European
computer manufacturers and thus become a standard in Europe, in is
only a matter of time before the American computer manufacturers have to
follow . (That is assuming they are still interested in selling to Europe.

tneff@dasys1.UUCP (Tom Neff) (07/06/88)

In article <1988Jun22.223158.1366@LTH.Se> torsten@DNA.LTH.Se (Torsten Olsson) bellyaches:
>US PC programmers still live in a 7-bit world!
>
>But we don't!

Not one but two meaningless generalizations.  US PC programmers live in
a multitude of "worlds," 7-, 8-bit and otherwise.

>All right, you get a lot of credit for producing lovely programs
>like uEmacs, Picnix, Tcless, and the like. 

This is what passes for PC programming in Europe?  Unix ports and
clones?  Take a look at BRIEF, PC-WRITE, MKS Toolkit and LIST.COM
sometime.  Blame Unix for 7-bit parochialism if you want, not PC
programmers.

> [then lists some characters]

>So, if your pet program is to become our pet, too, you have
>to rethink concerning using the 8th bit as a flag, you have
>to rewrite toupper, tolower, word scan, delete word, word
>counters and the like.

On the other hand, if you don't give a damn whether your program makes
the grade in Sweden or not, you can stick with what you're doing, or
publish the source and let Swedish programmers keep Swedish users happy
while we do the same here.  Just a thought.  :-)

In the meantime, if you have a complaint about the way some specific
programs behave, write the authors.  Thanks for the reminder about the
international character set, but I imagine most people (short of the
Unix clonesters, whom you seem to rely on unduly) either give it to you
for free (per the examples above) or have good reasons to abandon it in
favor of a performance tradeoff.

-- 
Tom Neff			UUCP: ...!cmcl2!phri!dasys1!tneff
	"None of your toys	CIS: 76556,2536	       MCI: TNEFF
	 will function..."	GEnie: TOMNEFF	       BIX: t.neff (no kidding)

ray@micomvax.UUCP (Ray Dunn) (07/06/88)

In article <7350@j.cc.purdue.edu> nwd@j.cc.purdue.edu.UUCP (Daniel Lawrence) writes:
 >.... Also keys can be set up like the following:
 >
 >store-macro-21
 >	insert-string &chr 130
 >!endm
 >bind-to-key execute-macro-20 FN^R
 >
 >	This causes an e with an acute accent to be inserted when the
 >alt-e combination is struck.
 >
 >	My problem now is this.... How do I determine how such a
 >character should be treated when converted to uppercase.  Not knowing
 >the different languages involved....

You're really answering the question yourself Daniel.  You must allow the
*user* to be able to specify the uc/lc and lc/uc relationships.

To be *fully* general, no assumptions about the reversibility of the
relationships should be made.

I won't suggest the syntax to you, 'cos although a user of Gosling Emacs on
UNIX, I prefer <another editor> on the PC, and I'm sure you can come up with
something fully user definable.
-- 
Ray Dunn.                      |   UUCP: ..!{philabs, mnetor}!micomvax!ray
Philips Electronics Ltd.       |   TEL : (514) 744-8200   Ext: 2347
600 Dr Frederik Philips Blvd   |   FAX : (514) 744-6455
St Laurent. Quebec.  H4M 2S9   |   TLX : 05-824090

ddb@ns.ns.com (David Dyer-Bennet) (07/08/88)

In article <133@dcs.UUCP>, wnp@dcs.UUCP (Wolf N. Paul) writes:
> The purpose of the "more new text than included quote" is to REDUCE VOLUME;
> to add garbage new text has exactly the opposite effect. Use common sense,
> folks!
  Common sense would dictate not attempting to impose a stupid "new text
ratio" requirement through the news software.  While it is possible to
"fool" that software, I have also chosen instead to make my protest visible
on the occasions when I've been saddled with it. 

-- 
                  -- David Dyer-Bennet
		     ddb@viper.Lynx.MN.Org, ...{amdahl,hpda}!bungia!viper!ddb
		     Fidonet 1:282/341.0, (612) 721-8967 hst/2400/1200/300

cjh@hpausla.HP.COM (Clifford Heath) (07/18/88)

> characters I've missed any reference to the X-OPEN groups Native
....
> only a matter of time before the American computer manufacturers have to
> follow .

Americans (like anyone) can be pretty parochial, but don't forget that
X-OPEN designed its standard around **Hewlett-Packard's** implementation
of Native Language Support, which was a working system six years ago.

X-OPEN is a good standard as far as it goes, but when you try to build a
DBMS that uses it, you run into problems that are (mostly) not present
in HP's implementation, mainly due to the lack of a distinction between
data language and user language, and an inadequate definition of how
such multi-lingual systems can operate.  Remember that when you sort
something, it only appears sorted as long as you look at it using the
same collating sequence.  The Btree indices used in most databases
become corrupt if you update them using an incorrect collating sequence,
so you MUST distinguish between user language (for messages etc) and
data language.

In any case, for anyone wanting to sell software outside their own
country, X-OPEN should be compulsory reading.

Clifford Heath, Hewlett Packard Australian Software Operation.
(UUCP: hplabs!hpfcla!hpausla!cjh, ACSnet: cjh@hpausla.oz)
I didn't get paid to say this, so if you don't like it, don't tell HP.
Don't tell me, either.