[comp.windows.x] 8 bits per char

alan@APPLE.COM (Alan Mimms) (02/22/89)

This is an impassioned plea for people NOT to strip "that annoying parity
bit" when dealing with characters translated from keyboard events.  This
"parity" bit is really a valid part of the character!  Many character
sets (INCLUDING ISO Latin 1) REQUIRE all 256 possible values to be
representable.  For example, if a user wants a "O-umlaut" (or O-diaeresis),
he won't get one when he's talking to a client that strips the high order
bit -- he'll get a "V" instead.  xterm is guilty of this.  Hopefully,
this will change with the next release (x11r4?).

Please keep those bits -- they're NOT GARBAGE.  Some servers can now
generate all 256 ISO Latin 1 keysyms.

Alan Mimms			My opinions are generally
Communications Products Group	pretty worthless, but
Apple Computer			they *are* my own...
...it's so simple that only a child can do it!  -- Tom Lehrer, "New Math"

john@acorn.co.uk (John Bowler) (02/24/89)

In article <8902211720.AA14715@internal.apple.com>, alan@APPLE.COM (Alan Mimms) writes:
> This is an impassioned plea for people NOT to strip "that annoying parity
> bit" when dealing with characters translated from keyboard events.  This
> "parity" bit is really a valid part of the character!  Many character
> sets (INCLUDING ISO Latin 1) REQUIRE all 256 possible values to be
> representable.  For example, if a user wants a "O-umlaut" (or O-diaeresis),
> he won't get one when he's talking to a client that strips the high order
> bit -- he'll get a "V" instead.  xterm is guilty of this.  Hopefully,
> this will change with the next release (x11r4?).
> 
But xterm neither receives not transmits ISO-Latin-1 characters - it
receives keycodes, it transmits the codes defined by the VT102 transmitted
codes (well documented in a VT102 manual... if you can find one).
The latter are basically ASCII (7 bit) codes with some variations for national
character sets.  The VT220 can transmit 8 bit codes (corresponding to 
the complete DEC multinational character set) but I don't believe the VT102 supports
these (I don't have a VT102 manual :-(.

If xterm receives a keycode whose current keysym mapping does not fall into
the DEC multinational character set I don't see what it can do about it.  Even
if it transmits an 8 bit character (say from the upper half of the multinational
character set), in UN*X the tty will normally clobber the eighth bit anyway.  As an
experiment I switched my xterm pty into raw mode and typed the <pound sterling>
key on my keyboard (this generates a keycode with a suggested keysym of XK_sterling).
I regretted it - it would seem that some 8 bit character did, in fact, get through,
because my csh promptly died (csh uses the ``spare'' bit in input characters while
parsing the line, if it receives a byte with the top bit set it screws up :-(.

This area is a total mess - but its not X's fault - and a real solution would be
a major change to most of the computer worlds preconceived ideas.  After all, what
use is an extra bit if you want to transmit Chinese or Japanese characters?

So how can you say what is the ``valid part'' of an arbitrary character stream? 
Surely that is a matter for the two programs at either end, or for international
standards (and the only really accepted standard - ASCII - says that there are
only 7 bits in a character).

John Bowler (jbowler@acorn.co.uk)

prc@maxim.ERBE.SE (Robert Claeson) (02/25/89)

In article <722@acorn.co.uk>, john@acorn.co.uk (John Bowler) writes:
> In article <8902211720.AA14715@internal.apple.com>, alan@APPLE.COM (Alan Mimms) writes:
> > This is an impassioned plea for people NOT to strip "that annoying parity
> > bit" when dealing with characters translated from keyboard events.  This
> > "parity" bit is really a valid part of the character!  Many character
> > sets (INCLUDING ISO Latin 1) REQUIRE all 256 possible values to be
> > representable.  For example, if a user wants a "O-umlaut" (or O-diaeresis),
> > he won't get one when he's talking to a client that strips the high order
> > bit -- he'll get a "V" instead.  xterm is guilty of this.  Hopefully,
> > this will change with the next release (x11r4?).

> But xterm neither receives not transmits ISO-Latin-1 characters - it
> receives keycodes, it transmits the codes defined by the VT102 transmitted
> codes (well documented in a VT102 manual... if you can find one).
> The latter are basically ASCII (7 bit) codes with some variations for national
> character sets.

Oh well. Why not make provisions for mapping the received keycodes into
the different national, 7-bit character sets (there's an ISO standard for
them) instead of just assuming that everything is ASCII and chopping
off the eight bit under the assupmtion that it is garbage or zero?

> The VT220 can transmit 8 bit codes (corresponding to the complete
> DEC multinational character set) but I don't believe the VT102 supports
> these (I don't have a VT102 manual :-(.

True, but DEC MCS isn't ISO 8859/1. It is much the same, but not
completely, so you would need to map between 8859/1 and DECMCS
if you wrote a VT220 emulator.

> If xterm receives a keycode whose current keysym mapping does not fall into
> the DEC multinational character set I don't see what it can do about it.

Ignore it. Don't chop off the eight bit to get some random ASCII character.

> Even if it transmits an 8 bit character (say from the upper half of
> the multinational character set), in UN*X the tty will normally clobber
> the eighth bit anyway.

If you with UNIX mean the one that comes from the company who has the
trademark for UNIX, it won't. If your vendor sets ISTRIP by default,
just unset it.

If you, however, with UNIX mean the one that is called 4.xBSD or 2.xBSD,
you're right. They screwed it up. They assumed that everything useful
is 7 bits.  Maybe it was, but that's  not true anymore. As far as I know,
they plan to fix this in 4.4BSD.

However, if your terminal generates a parity bit and your communication
is set to 7 data bits, then the tty driver should chop it. But then, your
terminal can't generate 8-bit characters anyway. But this can never happen
on a workstation running X, since there's no such communication involved.

> This area is a total mess - but its not X's fault - and a real
> solution would be a major change to most of the computer worlds
> preconceived ideas.

Computers and programs written in Europe and many other parts of
the world generally assumes 8-bit data paths. So change "computer
world" to "Anglo computer industry".

> After all, what use is an extra bit if you want to transmit Chinese
> or Japanese characters?

Not much. So in fact, your programs should not assume 7 or 8 bit
characters. They should use a character data type that's large enough
to hold a 16 bit (or maybe even 32 bit) character. If you think this
is a major waste of space for 7 or 8 bit character sets, make it
a user-defined (programmer-defined) data type that you can define to
anything you like (char, short, long). And never rely on it having a
particular size in your code. That way, it's fairly easy to adapt it
to other character set sizes than ASCII.

> So how can you say what is the ``valid part'' of an arbitrary character
> stream? Surely that is a matter for the two programs at either end,

No. Parity bits and such is part of the communication protocol, not the
data path. So in fact, the tty driver -- not your program -- should
check and strip parity bits. Your program should always rely on what's
coming from the tty driver is valid data. Of course, there's still the
question if one, two or even more bytes of data is a data item (a
character). I don't know how to handle this. Maybe someone else know?

> or for international standards (and the only really accepted standard
> - ASCII - says that there are only 7 bits in a character).

What? My only really accepted standard - ISO -  says that there may be
7 or 8 or whatever bits in a character.

I hope this didn't take too much space, but I think the subject is
too important to ingore with phrases like "almost all characters
are 7 bits, so why should I care?" and "why can't everyone use ASCII?".

-- 
Robert Claeson, ERBE DATA AB, P.O. Box 77, S-175 22 Jarfalla, Sweden
Tel: +46 (0)758-202 50  Fax: +46 (0)758-197 20
EUnet:   rclaeson@ERBE.SE               uucp:   {uunet,enea}!erbe.se!rclaeson
ARPAnet: rclaeson%ERBE.SE@uunet.UU.NET  BITNET: rclaeson@ERBE.SE

guy@auspex.UUCP (Guy Harris) (02/26/89)

>Even if it transmits an 8 bit character (say from the upper half of the
>multinational character set), in UN*X the tty will normally clobber the
>eighth bit anyway.

Only if your UNIX's tty driver is old and crufty.  More modern ones can
be told neither to strip the 8th bit on input or output, but to run in
"cooked" or "cbreak" mode (i.e., you don't have to go into "raw" mode to
get an 8-bit data path).

>As an experiment I switched my xterm pty into raw mode and typed the
><pound sterling> key on my keyboard (this generates a keycode with a
>suggested keysym of XK_sterling).

(Said keysym being, as I remember, the ISO Latin #1 code for "pound
sterling".)

>I regretted it - it would seem that some 8 bit character did, in fact,
>get through, because my csh promptly died (csh uses the ``spare'' bit
>in input characters while parsing the line, if it receives a byte with
>the top bit set it screws up :-(.

Eventually, more modern C shells will handle 8-bit characters as well. 
Some may already do so.

The bottom line is: *don't* use the inadequacies of some current UNIX
implementations as an excuse not to support 8-bit character sets in X11
terminal emulators; said inadequacies will not stick around forever.

>This area is a total mess - but its not X's fault - and a real solution
>would be a major change to most of the computer worlds preconceived
>ideas.  After all, what use is an extra bit if you want to transmit
>Chinese or Japanese characters?

If you want to transmit them using "EUC" code sets, the extra bit is
*quite* useful.  In the Japanese EUC set, bytes with the 8th bit not set
represent ASCII characters; bytes with the 8th bit set represent
characters from (what used to be called) the JIS 6226 and JIS 6220 sets,
or various "private" character sets.  6226 is a 14-bit character set, as
I remember, with two 7-bit bytes per character; the EUC version just
encodes those by turning the 8th bit on in both of those bytes.

>So how can you say what is the ``valid part'' of an arbitrary character
>stream?  Surely that is a matter for the two programs at either end, or
>for international standards (and the only really accepted standard -
>ASCII - says that there are only 7 bits in a character).

This will not be true forever; the ISO 8859 character sets are becoming
more widely accepted, for example.

john@acorn.co.uk (John Bowler) (02/28/89)

tmal@cl.cam.ac.uk (Mark Lomas) has pointed out to me that an ISO extension
to ASCII (he quotes ISO 2022-1973 - ie 15 years ago) extends the definition
to a full 8 bit character set, so my point about ASCII only supporting
7 bits is incorrect.  Indeed the ISO latin alphabets are now also widely
accepted, as is the DEC multinational character set (not much different
from ISO-Latin1) and these are all compatible with ASCII.  As Mark
says ``there is no excuse for new products which support only the obsolete
seven-bit standard.''

This raises the question of what VT102 really does support, does anyone
know whether:-

1) It is capable of dealing with 8 bit received codes, as well as 7 bit.
2) Whether there are mechanisms (like the compose character VT220 functionality)
 to output 8 bit characters (or 7 bit representations of these)?

More investigation of xterm shows that both Alan Mimms and myself were
half correct.  In fact xterm does not strip the parity bit on output (as
Alan stated) but it does strip it on receipt of a character!  Close
examination of the release 2 code shows the following code fragment in
in_put():-

			/* strip parity bit */
			for(i = bcnt, cp = bptr ; i > 0 ; i--)
				*cp++ &= CHAR;

so, you can type top-bit-set characters, and they get to the application,
but they do not reflect as top-bit-set...   This must be a bug.  The
code exists in release 3 too (in charproc.c, in_put()).  In VTparse()
(also in charproc.c) the case for DECREQTPARM fills in a reply data structure
with comments ``no parity'' and ``eight bits''.  I am mistified.

On a related matter I note that the value returned by xterm for the
ANSI currency symbol is 0224 - the ISO-Latin1 encoding value, whereas
my VT220 manual says that it should be 0230.  Which is correct?  (Note
that I still haven't got a VT102 manual :-).

John Bowler (jbowler@acorn.co,uk)

neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) (03/01/89)

 
 
Just out of curiosity I eliminated the 3 lines in Tekproc.c and charproc.c
of the R3 xterm which read "strip the parity bit" and now displaying of
the upper half of the ISO8859 characters works like a charm (this is ULTRIX
3.0 which has 8-bit-capable tty drivers). The input side still screws up
as none of my german input keys (function keys rebound with xmodmap) is
accepted. If anybody knows a simple fix, I'd be glad to hear about it.
 
	Burkhard Neidecker-Lutz, Digital CEC Karlruhe, Project NESTOR

neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) (03/02/89)

 
To answer my own question about the required modification to allow xterm
to deal with 8 bit characters, the following kludge seems to work. I hope
I didn't miss anything important, but it works for me.
 
*** xterm/charproc.c	Thu Jan 12 12:25:24 1989
--- xterm8/charproc.c	Wed Mar  1 15:24:35 1989
*************** VTparse(
*** 381,387 ****
  {
  	register TScreen *screen = &term->screen;
  	register int *parsestate = groundtable;
! 	register int c;
  	register char *cp;
  	register int row, col, top, bot, scstype;
  	extern int bitset(), bitclr(), finput(), TrackMouse();
--- 381,387 ----
  {
  	register TScreen *screen = &term->screen;
  	register int *parsestate = groundtable;
! 	register int c,ps;
  	register char *cp;
  	register int row, col, top, bot, scstype;
  	extern int bitset(), bitclr(), finput(), TrackMouse();
*************** VTparse(
*** 388,395 ****
  
  	if(setjmp(vtjmpbuf))
  		parsestate = groundtable;
! 	for( ; ; )
! 		switch(parsestate[c = doinput()]) {
  		 case CASE_PRINT:
  			/* printable characters */
  			top = bcnt > TEXT_BUF_SIZE ? TEXT_BUF_SIZE : bcnt;
--- 388,403 ----
  
  	if(setjmp(vtjmpbuf))
  		parsestate = groundtable;
! 	for( ; ; ) {
! 		c = 0xff & doinput();
! 		if (c < 0x80) {
! 		  ps = parsestate[c];
! 		} else if (c >= 0xa0 && parsestate == groundtable) {
! 		  ps = CASE_PRINT;
! 		} else {
! 		  continue;
! 		}
! 		switch(ps) {
  		 case CASE_PRINT:
  			/* printable characters */
  			top = bcnt > TEXT_BUF_SIZE ? TEXT_BUF_SIZE : bcnt;
*************** VTparse(
*** 930,935 ****
--- 938,944 ----
  			parsestate = groundtable;
  			break;
  		}
+ 	}
  }
  
  finput()
*************** in_put(
*** 1042,1050 ****
  			} else if(bcnt == 0)
  				Panic("input: read returned zero\n", 0);
  			else {
- 				/* strip parity bit */
- 				for(i = bcnt, cp = bptr ; i > 0 ; i--)
- 					*cp++ &= CHAR;
  				if(screen->scrollWidget && screen->scrollinput &&
  				 screen->topline < 0)
  					/* Scroll to bottom */
--- 1051,1056 ----
 
	Burkhard Neidecker-Lutz, Digital CEC Karlsruhe, Project NESTOR

prc@maxim.ERBE.SE (Robert Claeson) (03/04/89)

In article <727@acorn.co.uk>, john@acorn.co.uk (John Bowler) writes:

> This raises the question of what VT102 really does support, does anyone
> know whether:-
> 
> 1) It is capable of dealing with 8 bit received codes, as well as 7 bit.

No. It  can only handle various 7 bit character sets (ASCII is  only one
of them).

> 2) Whether there are mechanisms (like the compose character VT220 functionality)
>  to output 8 bit characters (or 7 bit representations of these)?

Nope.

> More investigation of xterm shows that both Alan Mimms and myself were
> half correct.  In fact xterm does not strip the parity bit on output (as
> Alan stated) but it does strip it on receipt of a character!

(Short code example deleted)

> so, you can type top-bit-set characters, and they get to the application,
> but they do not reflect as top-bit-set...   This must be a bug.

Yes, this is indeed a bug. A VT102 terminal can't handle 8 bit character
sets, but it sure can handle many different 7-bit character sets. The
old VT102 that's collecting dust at work uses one of the ISO 646 (I think)
codesets.

The point is, a VT102 emulator should be able to map the different 8859/1
codes into various 646 codes (user-selectable via a setup menu) and map
them back  again on output. In the swedish code set, the character at
the same position as the left bracket in the ASCII  character set is
A diareshis. The brackets and braces aren't there. Thus, when operating
in Swedish mode, braces should be ignored and  A diareshis should be
mapped into the right 7-bit position on input and output.

But really, it doesn't even need to perform the mapping. Just checking
what characters are legal in the  national character set and ignoring
the rest, and using 8859/1 internally, would be sufficient.

-- 
Robert Claeson, ERBE DATA AB, P.O. Box 77, S-175 22 Jarfalla, Sweden
Tel: +46 (0)758-202 50  Fax: +46 (0)758-197 20
EUnet:   rclaeson@ERBE.SE               uucp:   {uunet,enea}!erbe.se!rclaeson
ARPAnet: rclaeson%ERBE.SE@uunet.UU.NET  BITNET: rclaeson@ERBE.SE