newsuser@LTH.Se (Lund Institute of Technology news server) (06/23/88)
US PC programmers still live in a 7-bit world! But we don't! _____________ Yes, there i s intelligent life outside the USA. We even live in an 8-bit world, which must come as a shocking piece of news to some of you. All right, you get a lot of credit for producing lovely programs like uEmacs, Picnix, Tcless, and the like. They are lovely, yes, but just to a certain extent, because they are completely useless in Europe! Why? Well, we are sure the intelligent reader already grasps the reason. Take a look at the IBM PC character code set a b o v e ASCII 127. Our alphabet is there, too, and you just can't imagine what funny results your tools yield when encountering them. For instance, these are letters: Lower case Upper case | Lower case Upper case 129 154 | 145 146 130 - | 147 - 131 - | 148 153 132 142 | 149 - 133 - | 150 - 134 143 | 151 - 135 128 | 152 - 136 - | 160 - 137 - | 161 - 138 - | 162 - 139 - | 163 - 140 - | 164 165 141 - | So, if your pet program is to become our pet, too, you have to rethink concerning using the 8th bit as a flag, you have to rewrite toupper, tolower, word scan, delete word, word counters and the like. Then you, as well, will discover the joys in the real, 8-bit world! Have fun. -- Torsten Olsson, Dept of Comp Sc, Lund University, Box 118, S221 00 Lund, Sweden Phone: +46-46104930 (work), +46-46126768 (home) Bitnet: lthlib@seldc52 Internet: torsten@dna.lth.se or torsten%dna.lth.se@uunet.uu.net UUCP: {uunet,mcvax}!enea!dna.lth.se!torsten
bobc@killer.UUCP (Bob Calbridge) (06/25/88)
In article <1988Jun22.223158.1366@LTH.Se>, newsuser@LTH.Se (Lund Institute of Technology news server) writes: > Why? Well, we are sure the intelligent reader already grasps the > reason. Take a look at the IBM PC character code set a b o v e > ASCII 127. Our alphabet is there, too, and you just can't imagine > what funny results your tools yield when encountering them. > So, if your pet program is to become our pet, too, you have > to rethink concerning using the 8th bit as a flag, you have > to rewrite toupper, tolower, word scan, delete word, word > counters and the like. :-) But then that's why, as you included in your missive it's called ASCII. The A stands for American. Is there there such as thing as ISCII??? I can just imagine the loops and whirls you would have to go through to write a viable sort routine to include new alpha sorts. And who's to determine what the order of sort would be. I'll see ya at the international conference where we can hash this out. :-] Just for fun. Bob
wnp@dcs.UUCP (Wolf N. Paul) (06/25/88)
In article <1988Jun22.223158.1366@LTH.Se> torsten@DNA.LTH.Se (Torsten Olsson) writes: >US PC programmers still live in a 7-bit world! > >But we don't! > >Yes, there i s intelligent life outside the USA. We even live >in an 8-bit world, which must come as a shocking piece of news >to some of you. >Take a look at the IBM PC character code set a b o v e >ASCII 127. Our alphabet is there, too, and you just can't imagine >what funny results your tools yield when encountering them. > >So, if your pet program is to become our pet, too, you have >to rethink concerning using the 8th bit as a flag, you have >to rewrite toupper, tolower, word scan, delete word, word >counters and the like. As a native German-speaker I am aware of the problem Torsten refers to. However, it is a bit more complex than his posting implies. The C functions toupper()/tolower() rely on upper and lower case to be two parallel groups of consecutive codes within the ASCII scheme. The way IBM has chosen to implement non-English characters on its PC line is (a) non-standard (i.e. applies only to IBM-compatible machines) and (b) incompatible with the assumption about upper and lower case in ASCII and thus in C and other programming languages. (c) incompatible with the way European characters are implemented in MOST printers and ANSI terminals. The way IBM implemented it, all case functions would have to be table-driven, which is much less elegant than working with the parallel ranges of characters in standard ASCII. So all of you Europeans should lobby hardware manufacturers to implement foreign characters in an intelligent way, and in a STANDARD WAY across different architectures, and THEN you can reasonably expect the authors of compilers and libraries and tools to support these characters. -- Wolf N. Paul * 3387 Sam Rayburn Run * Carrollton TX 75007 * (214) 306-9101 UUCP: killer!dcs!wnp ESL: 62832882 DOMAIN: wnp@dcs.UUCP TLX: 910-380-0585 EES PLANO UD
caf@omen.UUCP (Chuck Forsberg WA7KGX) (06/26/88)
In article <1988Jun22.223158.1366@LTH.Se> torsten@DNA.LTH.Se (Torsten Olsson) writes:
:US PC programmers still live in a 7-bit world!
:
:But we don't!
:_____________
:
:
:Yes, there i s intelligent life outside the USA. We even live
:in an 8-bit world, which must come as a shocking piece of news
:to some of you.
Very strange, since the requests for ZMODEM extensions to deal with
7 bit transmission paths come mostly from European PSVAN users. The
major US PSVANS (Telenet, Tymnet, CIS, etc.) all support 8 bits!
Chuck Forsberg WA7KGX ...!tektronix!reed!omen!caf
Author of YMODEM, ZMODEM, Professional-YAM, ZCOMM, and DSZ
Omen Technology Inc "The High Reliability Software"
17505-V NW Sauvie IS RD Portland OR 97231 503-621-3406
TeleGodzilla BBS: 621-3746 CIS: 70007,2304 Genie: CAF
P.S.: ZCOMM and Professional-YAM do support 8 bit character sets.
tj@gpu.utcs.toronto.edu (Terry Jones) (06/27/88)
There is a difference between communicating and character sets. read the whole message before posting a reply like that.
neitzel@infbs.UUCP (Martin Neitzel) (06/28/88)
(Here comes yet anotherone of those European bastards... :-) torsten@DNA.LTH.Se (Torsten Olsson) writes: TO> US PC programmers still live in a 7-bit world! TO> [...] In article <126@dcs.UUCP> wnp@dcs.UUCP (Wolf N. Paul) replied: WNP> [...] WNP> The C functions toupper()/tolower() rely on upper and lower case to be WNP> two parallel groups of consecutive codes within the ASCII scheme. Basically Right. But one of the reasons for those macros/functions is to get independent of the used character set, or am I completely wrong? (I know, the UNIX(tm) <ctype.h> requires us to write something like "if (isascii(c) && islower(c) ...", but that should be relatively easy to port into an 8-bit environment. On the other hand, if someone thinks: "Hey, 7 bits in my char for the ascii code, now let's see what I can mess around with the 8th!" -- that's neither portable nor justified by K&R. [This is the point where I should mention that discussions about our European character sets are usually somewhat emotionally heated (probably because we always had kind of to "suffer" from ASCII here in Europe). None of the following replies are intended to attack Wolf Paul personally. But some points should be made clear. ] WNP> The way IBM has chosen to implement non-English characters on its WNP> PC line is WNP> (a) non-standard (i.e. applies only to IBM-compatible machines) and Then perhaps we can move to ISO Latin-X. That would still confront us with the same problems: non-ASCII, 8-bit. (For more on the standards issue, see below.) WNP> (b) incompatible with the assumption about upper and lower case in ASCII WNP> and thus in C and other programming languages. K&R did not made any assumptions about required character sets, except there must be at least one positive element contained. So far regarding "C & ASCII". WNP> (c) incompatible with the way European characters are implemented WNP> on MOST printers and ANSI terminals. Perhaps I should explain what "this way" was: The ascii characters like []{}\~ were considered as "not so useful for Europeans" and their codes were interpreted as national characters. That was a HACK, not a solution! Yes, usage of those ascii charcaters and national characters was mutually exclusive. Not "both of them" in one document, listen? [btw, that's one of the reasons why I write in English when programming, and not in my native language (and the language the people that want to read my programs here would like to read.)] Wolf is right: IBM did not constrain itself to any standard when introducing their PCs. But they made a first step into a reasonable direction: (1) Keeping 0-127 ASCII, and (2) Providing most of the Europeans with "their" characters. They certainly broke no previous standard. The previous practice as explained above was not a standard. As I said, it was a hack, and nobody ever before did care about finding a solution. WNP> The way IBM implemented it, all case functions would have to be WNO> table-driven, which is much less elegant than working with the WNP> parallel ranges of characters in standard ASCII. Why is the table-driven approach "much less elegant"? A syntax-table ala GNU's is easy to implement, costs few memory, simplifies and speeds up the code, offers classifications beyond "letter/upper/lower" in a simple way, is easy to re-configure at run time. (I think, for an editor a table-driven implementation is the only way to go. But let's keep to the 8-bit issue. [Btw, the character classification problem is one of Jon Bentleys primary examples for data-driven-programming in his "Programming Pearls".]) WNP> So all of you Europeans should lobby hardware manufacturers to WNP> implement foreign characters in an intelligent way, and in a WNP> STANDARD WAY across different architectures, and THEN you can WNP> reasonably expect the authors of compilers and libraries and WNP> tools to support these characters. The Intelligent Way will be some ascii extended to eigth bits. It maybe hard to accept for Americans that ASCII is not sufficient as an international code for information interchange (and information processing). A change has to be made and that may hurt one or the other, resulting in emotionally heated reactions on the American side. At least ASCII has been proven to be useful for some decades now, and large systems depend on it, including UNIX(tm). Authors of compilers, libraries, tools, and programming languages have begun to at least consider foreign character sets. ANSI C is the current popular example. While their trigraphs are just an superflous wart on a wart in my opinion, their concept of "locales" is just the thing we all think of. Your <ctype.h> for you, mine for me. In the meantime, please re-read Torsten Olssons article. All he asks for is to respect all bits in a character variable as pertaining to the character code itself. Don't mess around with the eigth, ninth, or whatever bit of it, please. We Europeans don't want to prohibit you Americans to write programs using the ASCII code anymore, don't get me wrong. If we want to extend your programs for our needs we will probably manage that as we did until now. Don't break your head with writing sorting routines for non-ASCII character sets. Thank you. Martin [PS: I have to admit: I am frightened of the day I have to write programs respecting Japanese/Chinese character sets, too. These poor fellows have to suffer from ASCII most, I think.]
wnp@killer.UUCP (Wolf Paul) (06/29/88)
In article <920@infbs.UUCP> neitzel@infbs.UUCP (Martin Neitzel) writes: >(Here comes yet anotherone of those European bastards... :-) Nun, so schlimm sind wir doch nicht! >torsten@DNA.LTH.Se (Torsten Olsson) writes: >TO> US PC programmers still live in a 7-bit world! >TO> [...] >In article <126@dcs.UUCP> wnp@dcs.UUCP (Wolf N. Paul) replied: >WNP> [...] >WNP> The C functions toupper()/tolower() rely on upper and lower case to be >WNP> two parallel groups of consecutive codes within the ASCII scheme. >Basically Right. But one of the reasons for those macros/functions >is to get independent of the used character set, or am I completely >wrong? (I know, the UNIX(tm) <ctype.h> requires us to write something >like "if (isascii(c) && islower(c) ...", but that should be relatively >easy to port into an 8-bit environment. On the other hand, if someone >thinks: "Hey, 7 bits in my char for the ascii code, now let's see what >I can mess around with the 8th!" -- that's neither portable nor justified >by K&R. I fully agree with you there. My point is that if we are going to use the 8-bit character set, and if, like IBM, we call it "extended ASCII", let's extend it consistently. In standard ASCII, uppercase alphabetics are codes 65-90 (Hex 41-5A), and lowercase alphas are codes 97-122 (Hex 61-7A). Thus, by adding 32 (Hex 20) to an uppercase character, I can convert it to lower case, and by subtracting the same amount from a lower case character, i can convert it to an uppercase character. If IBM's character set were really "extended ASCII", this would work for the non-English, 8-bit characters, as well. It doesn't ... Of course, IBM is not alone to blame. When I worked with Apple // computers, they would sell a computer in Germany or Austria which could be switched between German and English, but if you moved that computer to France, or even to Switzerland, you'd be stuck with some common characters which your computer didn't produce. They used the eighth bit as a "flashing" attribute. > None of the following replies are intended to attack Wolf Paul > personally. But some points should be made clear. Don't feel attacked. I'm European myself. > >WNP> The way IBM has chosen to implement non-English characters on its >WNP> PC line is >WNP> (a) non-standard (i.e. applies only to IBM-compatible machines) and > >Then perhaps we can move to ISO Latin-X. That would still confront us >with the same problems: non-ASCII, 8-bit. (For more on the standards >issue, see below.) I'd be very interested if you or someone else could send me a table of ISO Latin-X. >WNP> (b) incompatible with the assumption about upper and lower case in ASCII >WNP> and thus in C and other programming languages. > >K&R did not made any assumptions about required character sets, except >there must be at least one positive element contained. So far regarding >"C & ASCII". But in practice most compiler writers assume ASCII, probably because computer manufacturers claim that their machines support ASCII -- albeit "extended". >WNP> (c) incompatible with the way European characters are implemented >WNP> on MOST printers and ANSI terminals. > >Perhaps I should explain what "this way" was: The ascii characters like >[]{}\~ were considered as "not so useful for Europeans" and their codes >were interpreted as national characters. That was a HACK, not a >solution! Yes, usage of those ascii charcaters and national characters >was mutually exclusive. Not "both of them" in one document, listen? I agree with that assessment. >[btw, that's one of the reasons why I write in English when programming, > and not in my native language (and the language the people that want to > read my programs here would like to read.)] One reason I avoid my native language (same as yours) when dealing with computers and programming are the unbelievable kludges which seem to have entered it. I can't get myself talking or writing about "Fietschers" of a system. (I picked that one our of "Chip" a couple of years ago). >Wolf is right: IBM did not constrain itself to any standard when >introducing their PCs. But they made a first step into a reasonable >direction: >(1) Keeping 0-127 ASCII, and >(2) Providing most of the Europeans with "their" characters. > >They certainly broke no previous standard. The previous practice >as explained above was not a standard. As I said, it was a hack, and >nobody ever before did care about finding a solution. But they missed the mark when they failed to group upper and lower case characters together in a manner similar to standard ASCII. >WNP> The way IBM implemented it, all case functions would have to be >WNO> table-driven, which is much less elegant than working with the >WNP> parallel ranges of characters in standard ASCII. > >Why is the table-driven approach "much less elegant"? A syntax-table >ala GNU's is easy to implement, costs few memory, simplifies and speeds >up the code, offers classifications beyond "letter/upper/lower" in a >simple way, is easy to re-configure at run time. (I think, for an editor >a table-driven implementation is the only way to go. But let's keep to >the 8-bit issue. [Btw, the character classification problem is >one of Jon Bentleys primary examples for data-driven-programming >in his "Programming Pearls".]) Because of the harebrained way IBM assigned characters to the eight-bit positions, you would actually need two tables -- one to go from upper to lower case, and another vice-versa. THAT's inelegant. >The Intelligent Way will be some ascii extended to eigth bits. > >It maybe hard to accept for Americans that ASCII is not >sufficient as an international code for information interchange >(and information processing). A change has to be made and that >may hurt one or the other, resulting in emotionally heated >reactions on the American side. At least ASCII has been proven >to be useful for some decades now, and large systems depend on >it, including UNIX(tm). > >Authors of compilers, libraries, tools, and programming languages >have begun to at least consider foreign character sets. ANSI C >is the current popular example. While their trigraphs are just >an superflous wart on a wart in my opinion, their concept of >"locales" is just the thing we all think of. Your <ctype.h> for >you, mine for me. I basically agree, but with todays's terminals and printers, you need a different ctype.h for your screen and for your printer. > >In the meantime, please re-read Torsten Olssons article. All he >asks for is to respect all bits in a character variable as >pertaining to the character code itself. Don't mess around with >the eigth, ninth, or whatever bit of it, please. With reference to C compilers, I don't think there's any funny business going on with the eighth bit. They just don't support the IBM non-English characters as alphabetics, because they don't fit into the ASCII scheme, even if one extended it to eight bits. >We Europeans don't want to prohibit you Americans to write >programs using the ASCII code anymore, don't get me wrong. If >we want to extend your programs for our needs we will probably >manage that as we did until now. Don't break your head with >writing sorting routines for non-ASCII character sets. Now that would truly have to be localized on a per-language basis, because different languages using the same character shapes don't necessarily sort them in the same sequence. So we won't bother with it over here -- I'll wait until I get back to Europe, and then see which part I've landed in before I mess with sorting :-). >[PS: I have to admit: I am frightened of the day I have to write >programs respecting Japanese/Chinese character sets, too. These >poor fellows have to suffer from ASCII most, I think.] Well, that won't fit into 8 bits, anyway. -- Wolf N. Paul * 3387 Sam Rayburn Run * Carrollton TX 75007 * (214) 306-9101 UUCP: killer!dcs!wnp ESL: 62832882 DOMAIN: wnp@dcs.UUCP TLX: 910-380-0585 EES PLANO UD
karl@haddock.ISC.COM (Karl Heuer) (06/30/88)
This discussion belongs in comp.std.internat; I'm moving it there. In article <4635@killer.UUCP> wnp@killer.UUCP (Wolf Paul) writes: >In standard ASCII, uppercase alphabetics are codes 65-90 (Hex 41-5A), >and lowercase alphas are codes 97-122 (Hex 61-7A). Thus, by adding 32 >(Hex 20) to an uppercase character, I can convert it to lower case, and >by subtracting the same amount from a lower case character, i can convert >it to an uppercase character. > >If IBM's character set were really "extended ASCII", this would work for >the non-English, 8-bit characters, as well. It doesn't ... This would probably be a good idea, simply on the grounds that (if there's no other reason to prefer one ordering over another) it would make certain operations faster. But in any case, portable programs won't take advantage of such coincidences, except through portable interfaces such as toupper(). What about characters like German eszet, which have no uppercase equivalent? This has the property islower(c), but toupper(c) != (c-0x20) no matter how you arrange the character set. (In ANSI C, toupper(c) returns the argument unchanged if it can't be uppercasified.) >[neitzel@infbs.UUCP (Martin Neitzel) writes:] >>[On most European printers and terminals] The ascii characters like >>[]{}\~ were considered as "not so useful for Europeans" and their codes >>were interpreted as national characters. That was a HACK, not a >>solution! Yes, usage of those ascii charcaters and national characters >>was mutually exclusive. Not "both of them" in one document, listen? > >I agree with that assessment. So do I. But it does have the advantage that (a) it allows the +0x20 hack to work, and (b) it yields a contiguous alphabet, so MIN_LETR <= c <= MAX_LETR works (which is probably more often assumed true than the +0x20 idiom). Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
wcs@skep2.ATT.COM (Bill.Stewart.<ho95c>) (06/30/88)
In article <4635@killer.UUCP> wnp@killer.UUCP (Wolf Paul) writes:
: >WNP> The way IBM implemented it, all case functions would have to be
: >WNP> table-driven, which is much less elegant than working with the
: >WNP> parallel ranges of characters in standard ASCII.
: >Why is the table-driven approach "much less elegant"? A syntax-table
:Because of the harebrained way IBM assigned characters to the eight-bit
:positions, you would actually need two tables -- one to go from upper
:to lower case, and another vice-versa. THAT's inelegant.
But you have to do the equivalent now, in the sense of
islower(c) ? toupper(c) : c vs.
isupper(c) ? tolower(c) : c
(System V uses a function call that does this)
When I'm doing this kind of work, I almost always use a table -
TOUPPER(c) or TOLOWER(c)
- it always does the right thing, even if the character
wasn't upper or lower case. If you *must* have
specific-case character results, it's easy to get.
- it only takes one step, so it's efficient
- it only references the character once, so macros are safe
- it works on BSD and System V
- it's extensible to non-ASCII character sets
- if you have shared libraries it doesn't waste much space.
--
# Thanks;
# Bill Stewart, AT&T Bell Labs 2G218, Holmdel NJ 1-201-949-0705 ihnp4!ho95c!wcs
frisk@rhi.hi.is (Fridrik Skulason) (07/01/88)
In article <920@infbs.UUCP> neitzel@infbs.UUCP (Martin Neitzel) writes: > > On the other hand, if someone >thinks: "Hey, 7 bits in my char for the ascii code, now let's see what >I can mess around with the 8th!" -- that's neither portable nor justified >by K&R. > Agree! Another related problem is: "Well - nobody uses the 8th bit anyway, so ..... c &= 0x7f;" I have even run into a couple of C compilers that assume this, MEGAMAX C for the ATARI ST, and the C compiler for the Archimedes. This of course means that I advise students here at the university NOT to buy the latter machine. Sometimes one cannot ignore the problem like this. For example, I had to patch PC/NFS some days ago, since the terminal emulator stripped off the 8th bit. (And while doing it I added automatic translation from the PC character set to/from ISO 8859/1) Not so serious maybe, but still boring, are programs that produce a warning message if any character has the 8th bit set. (PC-Kermit for example) > >WNP> (c) incompatible with the way European characters are implemented >WNP> on MOST printers and ANSI terminals. > >Perhaps I should explain what "this way" was: The ascii characters like >[]{}\~ were considered as "not so useful for Europeans" and their codes >were interpreted as national characters. Here in Iceland this used to be true until a few years ago. Then it was decided to standardize on ISO 8859/1 (or "ECMA" code, as it was known then) Now we are using ISO 8858/1 on our VAXes (instead of DEC multinational) and on our HPs (instead of Roman8). Even the ATARI ST computer uses the ISO standard (instead of a not-quite-IBM-PC-compatible character set with hebrew extensions) So, now we can use [,],{,},|,\,@,^,~ and ` together with our national characters. The same goes in all other European countries - we are moving away from modified 7 bit ASCII to standardized 8 bit character sets. > >Wolf is right: IBM did not constrain itself to any standard when >introducing their PCs. But they made a first step into a reasonable >direction: >(1) Keeping 0-127 ASCII, and >(2) Providing most of the Europeans with "their" characters. Most of them, yes, but not all. Not, for example, all the Icelandic, Danish, Norwegian or Portuguese characters. They corrected that on the PS/2 series, by providing the CP-850 character set, which includes all the printable characters in ISO 8859/1 (Latin-1). Unfortunately, they are not in the same positions as in the ISO standard. > >WNP> The way IBM implemented it, all case functions would have to be >WNO> table-driven, which is much less elegant than working with the >WNP> parallel ranges of characters in standard ASCII. > So let's just use a character set like ISO 8859/1, where the European special characters are also (mostly) in pallalel ranges. One problem with parallel ranges is that some characters may only exist as lower case, like the german ess-tzet. (position DF in ISO 8859/1) (the character in position FF is y with two dots above, which is normally not used in upper case.) > >WNP> So all of you Europeans should lobby hardware manufacturers to >WNP> implement foreign characters in an intelligent way, and in a >WNP> STANDARD WAY across different architectures, and THEN you can >WNP> reasonably expect the authors of compilers and libraries and >WNP> tools to support these characters. > >The Intelligent Way will be some ascii extended to eigth bits. > Not just "some ascii" - ISO 8859/1 (or /2 /3 /4) > >Authors of compilers, libraries, tools, and programming languages >have begun to at least consider foreign character sets. ANSI C >is the current popular example. While their trigraphs are just >an superflous wart on a wart in my opinion, their concept of >"locales" is just the thing we all think of. Your <ctype.h> for >you, mine for me. The discussion about trigraphs in comp.lang.c some time ago looked somewhat silly to me, since some people were arguing that we (Europeans) needed those, where in fact we do not. After all, who cares if #define looks like <pound sign>define if it functions equally. -- Fridrik Skulason University of Iceland UUCP frisk@rhi.uucp BIX frisk This line intentionally left blank ...................
pinard@odyssee.UUCP (Francois Pinard) (07/01/88)
In article <699@omen.UUCP>, caf@omen.UUCP (Chuck Forsberg WA7KGX) writes: > In article <1988Jun22.223158.1366@LTH.Se> torsten@DNA.LTH.Se (Torsten Olsson) writes: > :US PC programmers still live in a 7-bit world! > :Yes, there i s intelligent life outside the USA. > The major US PSVANS (Telenet, Tymnet, CIS, etc.) all support 8 bits! We sometimes have the feeling that 8-bit support is mainly for binaries. (postnews requires me to give more new text, even if useless to do so...) -- Francois Pinard pinard@odyssee.uucp (514) 279-0716 `Vivement GNU!' pinard%odyssee@larry.mcrcim.mcgill.edu (514) 588-4656
wnp@dcs.UUCP (Wolf N. Paul) (07/02/88)
In article <1288@odyssee.UUCP> pinard@odyssee.UUCP (Francois Pinard) writes: >(postnews requires me >to give more new text, >even if useless to do >so...) I am posting this because Francois is not the only one who needs to be reminded of this. Actually, if you read the stuff in news.announce.newusers, you are not supposed to give more new text, but if your reply must be shorter than the quoted text (i.e. if you have to include more quoted text than the length of your own reply), you should "hide" the old text (i.e. fool postnews) by either changing the quote character from '>' to something else, or else, putting a space in front of each quote character as I have done above. The purpose of the "more new text than included quote" is to REDUCE VOLUME; to add garbage new text has exactly the opposite effect. Use common sense, folks! -- Wolf N. Paul * 3387 Sam Rayburn Run * Carrollton TX 75007 * (214) 306-9101 UUCP: killer!dcs!wnp ESL: 62832882 DOMAIN: wnp@dcs.UUCP TLX: 910-380-0585 EES PLANO UD
nwd@j.cc.purdue.edu (Daniel Lawrence) (07/02/88)
There has been a lot of talk about problems with 8 bit characters in various programs and editors. I had many comments about this problem in the last couple of years about MicroEMACS 3.9 (although it was 3.6 and 3.7 at the time) and I fixed it to handle transparently 8 bit characters quite some time ago. So, entering high byte characters from the keyboard using the alternet key works properly. Also keys can be set up like the following: store-macro-21 insert-string &chr 130 !endm bind-to-key execute-macro-20 FN^R This causes an e with an acute accent to be inserted when the alt-e combination is struck. My problem now is this.... How do I determine how such a character should be treated when converted to uppercase. Not knowing the different languages involved, could someone in the know post such info? Does it vary from language to language? How are characters like this treated on UNIX machines of different sorts. Rather than bemoaning the USA's programmers lack of attention, could someone tell us what is the right way to handle things like this? Daniel Lawrence (317) 742-5153 UUCP: {ihnp4!pur-ee!}j.cc.purdue.edu!nwd ARPA: nwd@j.cc.purdue.edu FIDO: 1:201/2 The Programmer's Room (317) 742-5533
madd@bu-cs.BU.EDU (Jim Frost) (07/03/88)
In article <126@dcs.UUCP> wnp@dcs.UUCP (Wolf N. Paul) writes: |The C functions toupper()/tolower() rely on upper and lower case to be |two parallel groups of consecutive codes within the ASCII scheme. Keep in mind that the implementations of isupper() and islower() often are table driven and do not make this assumption. Also, toupper() and tolower() are sometimes as simple as: #define toupper(A) (A - ('a' - 'A')) and other times are complex enough to check for upper/lowercaseness. I'd just implement my own that handled the extra characters. Who says that you have to use the routines supplied with your compiler? jim frost madd@bu-it.bu.edu
bcw@rti.UUCP (Bruce Wright) (07/04/88)
In article <4581@killer.UUCP>, bobc@killer.UUCP (Bob Calbridge) writes: > In article <1988Jun22.223158.1366@LTH.Se>, newsuser@LTH.Se (Lund Institute of Technology news server) writes: > > > Why? Well, we are sure the intelligent reader already grasps the > > reason. Take a look at the IBM PC character code set a b o v e > > ASCII 127. Our alphabet is there, too, and you just can't imagine > > what funny results your tools yield when encountering them. > > So, if your pet program is to become our pet, too, you have > > to rethink concerning using the 8th bit as a flag, you have > > to rewrite toupper, tolower, word scan, delete word, word > > counters and the like. > > :-) But then that's why, as you included in your missive it's > called ASCII. The A stands for American. Is there there > such as thing as ISCII??? > > I can just imagine the loops and whirls you would have to > go through to write a viable sort routine to include new > alpha sorts. And who's to determine what the order of sort > would be. I'll see ya at the international conference where > we can hash this out. :-] > I realize that Bob was just trying to be funny, but even if you live in the US you may want the international characters. If you are trying to correspond with people outside the English-speaking world (or sometimes even within the English-speaking world), it can be >>EXTREMELY<< useful to have the true characters rather than just their approximation in ASCII. This is especially true if you are using letters (you know, paper and ink and all that) rather than the network - your recipient will not appreciate misspellings which are "unavoidable" on your text editor. It's fairly easy to write a table to convert lower case to upper case for any given character set. Unfortunately the IBM set does not correspond to the ISO character set nor to the DEC multinational character set - this inhibits the portability of both files and programs. It would also be nice if the IBM-PC were set up to make >>GENERATING<< the characters easier -- the DEC terminals and PC's are much nicer in this respect. It is much easier to remember "Compose-e-'" than it is to remember "Alt-1-3-0" (not to mention fewer keystrokes). Unfortunately even on the DEC PC's most editors don't deal with the multinational set well (though they do on some of the larger DEC systems). You do have a real problem with sorts though - even though any GIVEN collating sequence is easy (just a table lookup), you will find that there is no UNIVERSAL collating sequence for all the "special" characters. For example, Spanish considers "Ll" a special character sequence and collates it separately from the English sequence "Ll". I suspect that you would have to have a number of translation tables. But that would be a relatively minor problem - at least for our purposes, we are more interested in generating text for printing than we are in sorting anything. Bruce C. Wright
daveb@geac.UUCP (David Collier-Brown) (07/04/88)
In article <7350@j.cc.purdue.edu> nwd@j.cc.purdue.edu.UUCP (Daniel Lawrence) writes: [...] > This causes an e with an acute accent to be inserted when the >alt-e combination is struck. > > My problem now is this.... How do I determine how such a >character should be treated when converted to uppercase. Not knowing >the different languages involved, could someone in the know post such >info? Does it vary from language to language? Yes, it does change. For example, french accented characters in France have their accents dropped when converted to uppercase. French accented characters in Canada retain their accents when converted to uppercase. This poses problems for a unilingual Anglo... How are characters like >this treated on UNIX machines of different sorts. Rather than bemoaning >the USA's programmers lack of attention, could someone tell us what is >the right way to handle things like this? Do a literature search, once you become aware there is a problem. [Programmers usually get shortchanged in university, by the way, and often do not hear about literature searches until they hit grad school, where the professors **freak** to discover that the student doesn't have this basic skill. If you've been shortchanged, descend on your local library's reference desk and have them point you at their holdings on the subject --dave] --dave (thats funny, there's no-one in the office today) c-b -- David Collier-Brown. {mnetor yunexus utgpu}!geac!daveb Geac Computers Ltd., | "His Majesty made you a major 350 Steelcase Road, | because he believed you would Markham, Ontario. | know when not to obey his orders"
billp@infmx.UUCP (Bill Potter) (07/05/88)
In all this discussion about international standards for 7 bit / 8bit characters I've missed any reference to the X-OPEN groups Native Language Support specification. It seems to me that everything anyone could want in terms of internationalisation is covered in this spec., and as it will now, hopefully, be implemented by the major European computer manufacturers and thus become a standard in Europe, in is only a matter of time before the American computer manufacturers have to follow . (That is assuming they are still interested in selling to Europe.
tneff@dasys1.UUCP (Tom Neff) (07/06/88)
In article <1988Jun22.223158.1366@LTH.Se> torsten@DNA.LTH.Se (Torsten Olsson) bellyaches: >US PC programmers still live in a 7-bit world! > >But we don't! Not one but two meaningless generalizations. US PC programmers live in a multitude of "worlds," 7-, 8-bit and otherwise. >All right, you get a lot of credit for producing lovely programs >like uEmacs, Picnix, Tcless, and the like. This is what passes for PC programming in Europe? Unix ports and clones? Take a look at BRIEF, PC-WRITE, MKS Toolkit and LIST.COM sometime. Blame Unix for 7-bit parochialism if you want, not PC programmers. > [then lists some characters] >So, if your pet program is to become our pet, too, you have >to rethink concerning using the 8th bit as a flag, you have >to rewrite toupper, tolower, word scan, delete word, word >counters and the like. On the other hand, if you don't give a damn whether your program makes the grade in Sweden or not, you can stick with what you're doing, or publish the source and let Swedish programmers keep Swedish users happy while we do the same here. Just a thought. :-) In the meantime, if you have a complaint about the way some specific programs behave, write the authors. Thanks for the reminder about the international character set, but I imagine most people (short of the Unix clonesters, whom you seem to rely on unduly) either give it to you for free (per the examples above) or have good reasons to abandon it in favor of a performance tradeoff. -- Tom Neff UUCP: ...!cmcl2!phri!dasys1!tneff "None of your toys CIS: 76556,2536 MCI: TNEFF will function..." GEnie: TOMNEFF BIX: t.neff (no kidding)
ray@micomvax.UUCP (Ray Dunn) (07/06/88)
In article <7350@j.cc.purdue.edu> nwd@j.cc.purdue.edu.UUCP (Daniel Lawrence) writes: >.... Also keys can be set up like the following: > >store-macro-21 > insert-string &chr 130 >!endm >bind-to-key execute-macro-20 FN^R > > This causes an e with an acute accent to be inserted when the >alt-e combination is struck. > > My problem now is this.... How do I determine how such a >character should be treated when converted to uppercase. Not knowing >the different languages involved.... You're really answering the question yourself Daniel. You must allow the *user* to be able to specify the uc/lc and lc/uc relationships. To be *fully* general, no assumptions about the reversibility of the relationships should be made. I won't suggest the syntax to you, 'cos although a user of Gosling Emacs on UNIX, I prefer <another editor> on the PC, and I'm sure you can come up with something fully user definable. -- Ray Dunn. | UUCP: ..!{philabs, mnetor}!micomvax!ray Philips Electronics Ltd. | TEL : (514) 744-8200 Ext: 2347 600 Dr Frederik Philips Blvd | FAX : (514) 744-6455 St Laurent. Quebec. H4M 2S9 | TLX : 05-824090
ddb@ns.ns.com (David Dyer-Bennet) (07/08/88)
In article <133@dcs.UUCP>, wnp@dcs.UUCP (Wolf N. Paul) writes: > The purpose of the "more new text than included quote" is to REDUCE VOLUME; > to add garbage new text has exactly the opposite effect. Use common sense, > folks! Common sense would dictate not attempting to impose a stupid "new text ratio" requirement through the news software. While it is possible to "fool" that software, I have also chosen instead to make my protest visible on the occasions when I've been saddled with it. -- -- David Dyer-Bennet ddb@viper.Lynx.MN.Org, ...{amdahl,hpda}!bungia!viper!ddb Fidonet 1:282/341.0, (612) 721-8967 hst/2400/1200/300
cjh@hpausla.HP.COM (Clifford Heath) (07/18/88)
> characters I've missed any reference to the X-OPEN groups Native .... > only a matter of time before the American computer manufacturers have to > follow . Americans (like anyone) can be pretty parochial, but don't forget that X-OPEN designed its standard around **Hewlett-Packard's** implementation of Native Language Support, which was a working system six years ago. X-OPEN is a good standard as far as it goes, but when you try to build a DBMS that uses it, you run into problems that are (mostly) not present in HP's implementation, mainly due to the lack of a distinction between data language and user language, and an inadequate definition of how such multi-lingual systems can operate. Remember that when you sort something, it only appears sorted as long as you look at it using the same collating sequence. The Btree indices used in most databases become corrupt if you update them using an incorrect collating sequence, so you MUST distinguish between user language (for messages etc) and data language. In any case, for anyone wanting to sell software outside their own country, X-OPEN should be compulsory reading. Clifford Heath, Hewlett Packard Australian Software Operation. (UUCP: hplabs!hpfcla!hpausla!cjh, ACSnet: cjh@hpausla.oz) I didn't get paid to say this, so if you don't like it, don't tell HP. Don't tell me, either.