alan@APPLE.COM (Alan Mimms) (02/22/89)
This is an impassioned plea for people NOT to strip "that annoying parity bit" when dealing with characters translated from keyboard events. This "parity" bit is really a valid part of the character! Many character sets (INCLUDING ISO Latin 1) REQUIRE all 256 possible values to be representable. For example, if a user wants a "O-umlaut" (or O-diaeresis), he won't get one when he's talking to a client that strips the high order bit -- he'll get a "V" instead. xterm is guilty of this. Hopefully, this will change with the next release (x11r4?). Please keep those bits -- they're NOT GARBAGE. Some servers can now generate all 256 ISO Latin 1 keysyms. Alan Mimms My opinions are generally Communications Products Group pretty worthless, but Apple Computer they *are* my own... ...it's so simple that only a child can do it! -- Tom Lehrer, "New Math"
john@acorn.co.uk (John Bowler) (02/24/89)
In article <8902211720.AA14715@internal.apple.com>, alan@APPLE.COM (Alan Mimms) writes: > This is an impassioned plea for people NOT to strip "that annoying parity > bit" when dealing with characters translated from keyboard events. This > "parity" bit is really a valid part of the character! Many character > sets (INCLUDING ISO Latin 1) REQUIRE all 256 possible values to be > representable. For example, if a user wants a "O-umlaut" (or O-diaeresis), > he won't get one when he's talking to a client that strips the high order > bit -- he'll get a "V" instead. xterm is guilty of this. Hopefully, > this will change with the next release (x11r4?). > But xterm neither receives not transmits ISO-Latin-1 characters - it receives keycodes, it transmits the codes defined by the VT102 transmitted codes (well documented in a VT102 manual... if you can find one). The latter are basically ASCII (7 bit) codes with some variations for national character sets. The VT220 can transmit 8 bit codes (corresponding to the complete DEC multinational character set) but I don't believe the VT102 supports these (I don't have a VT102 manual :-(. If xterm receives a keycode whose current keysym mapping does not fall into the DEC multinational character set I don't see what it can do about it. Even if it transmits an 8 bit character (say from the upper half of the multinational character set), in UN*X the tty will normally clobber the eighth bit anyway. As an experiment I switched my xterm pty into raw mode and typed the <pound sterling> key on my keyboard (this generates a keycode with a suggested keysym of XK_sterling). I regretted it - it would seem that some 8 bit character did, in fact, get through, because my csh promptly died (csh uses the ``spare'' bit in input characters while parsing the line, if it receives a byte with the top bit set it screws up :-(. This area is a total mess - but its not X's fault - and a real solution would be a major change to most of the computer worlds preconceived ideas. After all, what use is an extra bit if you want to transmit Chinese or Japanese characters? So how can you say what is the ``valid part'' of an arbitrary character stream? Surely that is a matter for the two programs at either end, or for international standards (and the only really accepted standard - ASCII - says that there are only 7 bits in a character). John Bowler (jbowler@acorn.co.uk)
prc@maxim.ERBE.SE (Robert Claeson) (02/25/89)
In article <722@acorn.co.uk>, john@acorn.co.uk (John Bowler) writes: > In article <8902211720.AA14715@internal.apple.com>, alan@APPLE.COM (Alan Mimms) writes: > > This is an impassioned plea for people NOT to strip "that annoying parity > > bit" when dealing with characters translated from keyboard events. This > > "parity" bit is really a valid part of the character! Many character > > sets (INCLUDING ISO Latin 1) REQUIRE all 256 possible values to be > > representable. For example, if a user wants a "O-umlaut" (or O-diaeresis), > > he won't get one when he's talking to a client that strips the high order > > bit -- he'll get a "V" instead. xterm is guilty of this. Hopefully, > > this will change with the next release (x11r4?). > But xterm neither receives not transmits ISO-Latin-1 characters - it > receives keycodes, it transmits the codes defined by the VT102 transmitted > codes (well documented in a VT102 manual... if you can find one). > The latter are basically ASCII (7 bit) codes with some variations for national > character sets. Oh well. Why not make provisions for mapping the received keycodes into the different national, 7-bit character sets (there's an ISO standard for them) instead of just assuming that everything is ASCII and chopping off the eight bit under the assupmtion that it is garbage or zero? > The VT220 can transmit 8 bit codes (corresponding to the complete > DEC multinational character set) but I don't believe the VT102 supports > these (I don't have a VT102 manual :-(. True, but DEC MCS isn't ISO 8859/1. It is much the same, but not completely, so you would need to map between 8859/1 and DECMCS if you wrote a VT220 emulator. > If xterm receives a keycode whose current keysym mapping does not fall into > the DEC multinational character set I don't see what it can do about it. Ignore it. Don't chop off the eight bit to get some random ASCII character. > Even if it transmits an 8 bit character (say from the upper half of > the multinational character set), in UN*X the tty will normally clobber > the eighth bit anyway. If you with UNIX mean the one that comes from the company who has the trademark for UNIX, it won't. If your vendor sets ISTRIP by default, just unset it. If you, however, with UNIX mean the one that is called 4.xBSD or 2.xBSD, you're right. They screwed it up. They assumed that everything useful is 7 bits. Maybe it was, but that's not true anymore. As far as I know, they plan to fix this in 4.4BSD. However, if your terminal generates a parity bit and your communication is set to 7 data bits, then the tty driver should chop it. But then, your terminal can't generate 8-bit characters anyway. But this can never happen on a workstation running X, since there's no such communication involved. > This area is a total mess - but its not X's fault - and a real > solution would be a major change to most of the computer worlds > preconceived ideas. Computers and programs written in Europe and many other parts of the world generally assumes 8-bit data paths. So change "computer world" to "Anglo computer industry". > After all, what use is an extra bit if you want to transmit Chinese > or Japanese characters? Not much. So in fact, your programs should not assume 7 or 8 bit characters. They should use a character data type that's large enough to hold a 16 bit (or maybe even 32 bit) character. If you think this is a major waste of space for 7 or 8 bit character sets, make it a user-defined (programmer-defined) data type that you can define to anything you like (char, short, long). And never rely on it having a particular size in your code. That way, it's fairly easy to adapt it to other character set sizes than ASCII. > So how can you say what is the ``valid part'' of an arbitrary character > stream? Surely that is a matter for the two programs at either end, No. Parity bits and such is part of the communication protocol, not the data path. So in fact, the tty driver -- not your program -- should check and strip parity bits. Your program should always rely on what's coming from the tty driver is valid data. Of course, there's still the question if one, two or even more bytes of data is a data item (a character). I don't know how to handle this. Maybe someone else know? > or for international standards (and the only really accepted standard > - ASCII - says that there are only 7 bits in a character). What? My only really accepted standard - ISO - says that there may be 7 or 8 or whatever bits in a character. I hope this didn't take too much space, but I think the subject is too important to ingore with phrases like "almost all characters are 7 bits, so why should I care?" and "why can't everyone use ASCII?". -- Robert Claeson, ERBE DATA AB, P.O. Box 77, S-175 22 Jarfalla, Sweden Tel: +46 (0)758-202 50 Fax: +46 (0)758-197 20 EUnet: rclaeson@ERBE.SE uucp: {uunet,enea}!erbe.se!rclaeson ARPAnet: rclaeson%ERBE.SE@uunet.UU.NET BITNET: rclaeson@ERBE.SE
guy@auspex.UUCP (Guy Harris) (02/26/89)
>Even if it transmits an 8 bit character (say from the upper half of the >multinational character set), in UN*X the tty will normally clobber the >eighth bit anyway. Only if your UNIX's tty driver is old and crufty. More modern ones can be told neither to strip the 8th bit on input or output, but to run in "cooked" or "cbreak" mode (i.e., you don't have to go into "raw" mode to get an 8-bit data path). >As an experiment I switched my xterm pty into raw mode and typed the ><pound sterling> key on my keyboard (this generates a keycode with a >suggested keysym of XK_sterling). (Said keysym being, as I remember, the ISO Latin #1 code for "pound sterling".) >I regretted it - it would seem that some 8 bit character did, in fact, >get through, because my csh promptly died (csh uses the ``spare'' bit >in input characters while parsing the line, if it receives a byte with >the top bit set it screws up :-(. Eventually, more modern C shells will handle 8-bit characters as well. Some may already do so. The bottom line is: *don't* use the inadequacies of some current UNIX implementations as an excuse not to support 8-bit character sets in X11 terminal emulators; said inadequacies will not stick around forever. >This area is a total mess - but its not X's fault - and a real solution >would be a major change to most of the computer worlds preconceived >ideas. After all, what use is an extra bit if you want to transmit >Chinese or Japanese characters? If you want to transmit them using "EUC" code sets, the extra bit is *quite* useful. In the Japanese EUC set, bytes with the 8th bit not set represent ASCII characters; bytes with the 8th bit set represent characters from (what used to be called) the JIS 6226 and JIS 6220 sets, or various "private" character sets. 6226 is a 14-bit character set, as I remember, with two 7-bit bytes per character; the EUC version just encodes those by turning the 8th bit on in both of those bytes. >So how can you say what is the ``valid part'' of an arbitrary character >stream? Surely that is a matter for the two programs at either end, or >for international standards (and the only really accepted standard - >ASCII - says that there are only 7 bits in a character). This will not be true forever; the ISO 8859 character sets are becoming more widely accepted, for example.
john@acorn.co.uk (John Bowler) (02/28/89)
tmal@cl.cam.ac.uk (Mark Lomas) has pointed out to me that an ISO extension to ASCII (he quotes ISO 2022-1973 - ie 15 years ago) extends the definition to a full 8 bit character set, so my point about ASCII only supporting 7 bits is incorrect. Indeed the ISO latin alphabets are now also widely accepted, as is the DEC multinational character set (not much different from ISO-Latin1) and these are all compatible with ASCII. As Mark says ``there is no excuse for new products which support only the obsolete seven-bit standard.'' This raises the question of what VT102 really does support, does anyone know whether:- 1) It is capable of dealing with 8 bit received codes, as well as 7 bit. 2) Whether there are mechanisms (like the compose character VT220 functionality) to output 8 bit characters (or 7 bit representations of these)? More investigation of xterm shows that both Alan Mimms and myself were half correct. In fact xterm does not strip the parity bit on output (as Alan stated) but it does strip it on receipt of a character! Close examination of the release 2 code shows the following code fragment in in_put():- /* strip parity bit */ for(i = bcnt, cp = bptr ; i > 0 ; i--) *cp++ &= CHAR; so, you can type top-bit-set characters, and they get to the application, but they do not reflect as top-bit-set... This must be a bug. The code exists in release 3 too (in charproc.c, in_put()). In VTparse() (also in charproc.c) the case for DECREQTPARM fills in a reply data structure with comments ``no parity'' and ``eight bits''. I am mistified. On a related matter I note that the value returned by xterm for the ANSI currency symbol is 0224 - the ISO-Latin1 encoding value, whereas my VT220 manual says that it should be 0230. Which is correct? (Note that I still haven't got a VT102 manual :-). John Bowler (jbowler@acorn.co,uk)
neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) (03/01/89)
Just out of curiosity I eliminated the 3 lines in Tekproc.c and charproc.c of the R3 xterm which read "strip the parity bit" and now displaying of the upper half of the ISO8859 characters works like a charm (this is ULTRIX 3.0 which has 8-bit-capable tty drivers). The input side still screws up as none of my german input keys (function keys rebound with xmodmap) is accepted. If anybody knows a simple fix, I'd be glad to hear about it. Burkhard Neidecker-Lutz, Digital CEC Karlruhe, Project NESTOR
neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) (03/02/89)
To answer my own question about the required modification to allow xterm to deal with 8 bit characters, the following kludge seems to work. I hope I didn't miss anything important, but it works for me. *** xterm/charproc.c Thu Jan 12 12:25:24 1989 --- xterm8/charproc.c Wed Mar 1 15:24:35 1989 *************** VTparse( *** 381,387 **** { register TScreen *screen = &term->screen; register int *parsestate = groundtable; ! register int c; register char *cp; register int row, col, top, bot, scstype; extern int bitset(), bitclr(), finput(), TrackMouse(); --- 381,387 ---- { register TScreen *screen = &term->screen; register int *parsestate = groundtable; ! register int c,ps; register char *cp; register int row, col, top, bot, scstype; extern int bitset(), bitclr(), finput(), TrackMouse(); *************** VTparse( *** 388,395 **** if(setjmp(vtjmpbuf)) parsestate = groundtable; ! for( ; ; ) ! switch(parsestate[c = doinput()]) { case CASE_PRINT: /* printable characters */ top = bcnt > TEXT_BUF_SIZE ? TEXT_BUF_SIZE : bcnt; --- 388,403 ---- if(setjmp(vtjmpbuf)) parsestate = groundtable; ! for( ; ; ) { ! c = 0xff & doinput(); ! if (c < 0x80) { ! ps = parsestate[c]; ! } else if (c >= 0xa0 && parsestate == groundtable) { ! ps = CASE_PRINT; ! } else { ! continue; ! } ! switch(ps) { case CASE_PRINT: /* printable characters */ top = bcnt > TEXT_BUF_SIZE ? TEXT_BUF_SIZE : bcnt; *************** VTparse( *** 930,935 **** --- 938,944 ---- parsestate = groundtable; break; } + } } finput() *************** in_put( *** 1042,1050 **** } else if(bcnt == 0) Panic("input: read returned zero\n", 0); else { - /* strip parity bit */ - for(i = bcnt, cp = bptr ; i > 0 ; i--) - *cp++ &= CHAR; if(screen->scrollWidget && screen->scrollinput && screen->topline < 0) /* Scroll to bottom */ --- 1051,1056 ---- Burkhard Neidecker-Lutz, Digital CEC Karlsruhe, Project NESTOR
prc@maxim.ERBE.SE (Robert Claeson) (03/04/89)
In article <727@acorn.co.uk>, john@acorn.co.uk (John Bowler) writes: > This raises the question of what VT102 really does support, does anyone > know whether:- > > 1) It is capable of dealing with 8 bit received codes, as well as 7 bit. No. It can only handle various 7 bit character sets (ASCII is only one of them). > 2) Whether there are mechanisms (like the compose character VT220 functionality) > to output 8 bit characters (or 7 bit representations of these)? Nope. > More investigation of xterm shows that both Alan Mimms and myself were > half correct. In fact xterm does not strip the parity bit on output (as > Alan stated) but it does strip it on receipt of a character! (Short code example deleted) > so, you can type top-bit-set characters, and they get to the application, > but they do not reflect as top-bit-set... This must be a bug. Yes, this is indeed a bug. A VT102 terminal can't handle 8 bit character sets, but it sure can handle many different 7-bit character sets. The old VT102 that's collecting dust at work uses one of the ISO 646 (I think) codesets. The point is, a VT102 emulator should be able to map the different 8859/1 codes into various 646 codes (user-selectable via a setup menu) and map them back again on output. In the swedish code set, the character at the same position as the left bracket in the ASCII character set is A diareshis. The brackets and braces aren't there. Thus, when operating in Swedish mode, braces should be ignored and A diareshis should be mapped into the right 7-bit position on input and output. But really, it doesn't even need to perform the mapping. Just checking what characters are legal in the national character set and ignoring the rest, and using 8859/1 internally, would be sufficient. -- Robert Claeson, ERBE DATA AB, P.O. Box 77, S-175 22 Jarfalla, Sweden Tel: +46 (0)758-202 50 Fax: +46 (0)758-197 20 EUnet: rclaeson@ERBE.SE uucp: {uunet,enea}!erbe.se!rclaeson ARPAnet: rclaeson%ERBE.SE@uunet.UU.NET BITNET: rclaeson@ERBE.SE