[comp.text] International character set requirements needed

npn@cbnewsl.att.com (nils-peter.nelson) (12/18/90)

We have gotten several requests for support of international
character sets, and for "8 bit clean" troff. We need some
help figuring out what is needed.  Ignoring for the moment
Kanji as a separate case, this inquiry is directed to the
European market.
If we select a specific character, say the Swedish character
that looks like an a with a circle over it, as in the second
character of my ancestral home town, Bastad, then the following
implementations are possible:
1. Using the current DWB troff longname convention, you can
address any PostScript character, viz. "B\N'aring'stad"
(aring is the Adobe PostScript name for this character, which,
for the non-Swedes, in pronounced "oh").
2. We could invent a new troff escape, viz. "B\(aostad".
We don't want to invent one if there is an established convention.
3. There are apparently 8 bit internal representations that
are created by double striking keys (perhaps ESCAPE A, but I
don't know). I need to know if there is a practical standard
for such (i.e., one people actually use).
4. I recall some countries (e.g., Scandinavian) use ordinary 7 bit
punctuation chars in lieu of these letters, viz. "B{stad"
or somesuch.

I don't want to hear about every pending international standard.
I want to hear what people who will be using the software would
like to see.

If you have answers to the above, please send them to npn@mhuxo.att.com.
I am looking for data, not opinions, so replies from Europe
will be heavily weighted over others. I am not trying to
create another "gaol" vs. "jail" controversy.

yfcw14@castle.ed.ac.uk (K P Donnelly) (12/19/90)

I have come in on this discussion from the outside, but it sounds as if
you have misunderstood the requests.

It sounds to me as if what people are asking for is for troff to stop
stripping the eighth bit off characters in the input file, but instead
to pass them to the output file just like (7-bit) ASCII characters.

There is no need to invent new (7-bit) ASCII representations of non
ASCII characters, such as \ao for a-ring.  Such representations may 
be desirable for other purposes, but that is a separate issue.

Anyone with a VT220 or VT320 terminal can input a-ring using the 
three character sequence  
              <Compose-character> a *
the character gets hexadecimal code E5 in compliance with the ISO
standard for western European languages.  You see it on the screen and
edit it just like any other character using any editor (such as the
version 3.10 of microEmacs) which doesn't strip the eighth bit.
However, it gets very frustrating if the software which gets between you
and the laser printer insists for no good reason on stripping the eighth
bit and turning the a-ring into an 'e' (ASCII code 65 hex).

There is lots of such software around, especially on Unix.  I think it
is something to do with the eighth bit having been used for parity check
in the past, so the software thought it was safest to filter it out. 
Nowadays, with cleaner communications lines, the eighth bit is hardly
ever used for parity check - it wasn't a very good system anyway - and
software packages which in the past have stripped the eighth bit are one
by one changing their policies - witness Kermit 3.0, TeX 3.0, microEmacs
3.10.

The Scandinavians have up til now used "national versions" of ASCII in
which characters like { } ~ | are replaced by national characters like
a-ring.  This often makes their names look weird in mail signatures, and
must cause them a lot of trouble when programming in languages like C
and Pascal.  The Germans use in computing the alternative system of
placing an 'e' after the vowel instead of an umlaut sign above it.  The
French have such a variety of accented characters that in computing
(mail messages and so on) they usually give up and leave out the
accents.  Sometimes they use devices like puting an apostrophe before or
after the vowel to indicate an acute accent.  On the Gaelic language 
bulletin board in which I participate, we always use a slash, '/', after
the vowel to indicate an acute accent.  The Icelanders have many more 
non-ASCII characters in their language than other Scandinavians, 
including some unique ones such as 'eth' and 'thorn' where you can't 
just "leave out the accent", so they have for long been ahead of the 
world in using 8-bit text in their computing.

   Kevin Donnelly

rcd@ico.isc.com (Dick Dunn) (12/20/90)

yfcw14@castle.ed.ac.uk (K P Donnelly) writes:

> It sounds to me as if what people are asking for is for troff to stop
> stripping the eighth bit off characters in the input file, but instead
> to pass them to the output file just like (7-bit) ASCII characters.

It's not at all that simple.  Troff has to know about the characters--it
needs to be able to find them in its width tables and know whether the
characters have ascenders and/or descenders (for sb/st/ct number regs).

There's also an issue of whether troff should produce 8-bit codes on
its output--there are some good arguments that it should not.  The matter
of 7-bit data paths is rather more complicated (and clumsy) than the single
issue of a parity bit that Donnelly mentions.  There are some methods of
data interchange, such as most email systems, that are inherently 7-bit.
It would be nice if we could just banish them, but compatibility is an
albatross.

The issue of inventing alternate representations, such as \(ao for "a ring"
goes beyond the issue of simple 8-bit transparency.  There are many more
characters needed than can be represented in an 8-bit code set.  Certainly
one wants a conventional 8-bit set (such as Latin 1) for convenience, but
more characters are needed even for European usage.  It is useful to have a
canonical representation in terms of 7-bit codes even if it's not the most
commonly used.

> The Scandinavians have up til now used "national versions" of ASCII in
> which characters like { } ~ | are replaced by national characters like
> a-ring...

These are not ASCII.  They are national versions of ISO 646.  If you like,
you could think of ASCII as a "national version" of ISO 646 used in the
USA.  646 provides a few codes which are reserved for national characters;
ASCII provides a particular assignment to those codes.  The Scandinavian
conventions are simply different assignments.

> ...The Germans use in computing the alternative system of
> placing an 'e' after the vowel instead of an umlaut sign above it...

This alternative representation far predates computer usage, although it is
certainly a convenient solution.  Note also that scharfes ess turns into
"ss".
-- 
Dick Dunn     rcd@ico.isc.com -or- ico!rcd       Boulder, CO   (303)449-2870
   ...Mr. Natural says, "Use the right tool for the job."

clewis@ecicrl.UUCP (Chris Lewis) (12/21/90)

In article <1990Dec20.012516.23623@ico.isc.com> rcd@ico.isc.com (Dick Dunn) writes:
>yfcw14@castle.ed.ac.uk (K P Donnelly) writes:

>> It sounds to me as if what people are asking for is for troff to stop
>> stripping the eighth bit off characters in the input file, but instead
>> to pass them to the output file just like (7-bit) ASCII characters.

>It's not at all that simple.  Troff has to know about the characters--it
>needs to be able to find them in its width tables and know whether the
>characters have ascenders and/or descenders (for sb/st/ct number regs).

Donnelly is right though, that's primarily what people *do* want.
(Note that 8-bit clean in ditroff doesn't imply that ditroff is passing
the 8-bit characters directly to the printer - on the contrary, it
only means that 8-bit characters can appear in the ditroff *input* file
file, and 8-bit characters can be found (somehow) in the width tables.  There's
no particular reason that the ditroff format output file (troff(5)) actually
contains 8-bit characters, for this file isn't really intended to be read,
and extensions to the "c<char>" directive could be altered to permit
octal or some other 7-bit clean representation if necessary.

[Refresher on ditroff guts:
    troff document -> ditroff -> ditroff intermediate -> filter -> printer
			^                                  ^
			|                                  |
			+-------------+--------------------+
				      |
				  width tables

The width tables contain the width and kerning information that ditroff
needs to know for character placement, and also contains the byte
that the filter emits for each character (though, the filter doesn't
have to use them).  The ditroff intermediate is a displayable file with
simple commands indicating character placement, font size, points etc.]

It shouldn't be all that much harder for ditroff to permit 8-bit characters
in the width tables.  After all, it does permit octal sequences in the
fourth field (the character the backend is to emit to generate the desired
glyph).  It would be nicest if the left most column (the character
ditroff is searching for) could be 8-bit, but octal would probably serve
in a pinch, permitting both would be even better (and would permit
transmission of these files over 7-bit paths/editting via 7-bit vi's etc.)

Psroff's analogous tables do permit this.

T'would be especially nice now that the newest vi's are now 8-bit clean.
And emacs is now as well.

>There's also an issue of whether troff should produce 8-bit codes on
>its output--there are some good arguments that it should not.  The matter
>of 7-bit data paths is rather more complicated (and clumsy) than the single
>issue of a parity bit that Donnelly mentions.  There are some methods of
>data interchange, such as most email systems, that are inherently 7-bit.
>It would be nice if we could just banish them, but compatibility is an
>albatross.

Since you're talking about troff generating 8-bit codes on its output,
I'm not sure that this is a real issue, because the intermediate ditroff
format output isn't really an interchange format.  Regardless, what people
want is the ability to jam 8-bit characters into the input of ditroff and
have it do sane things, not necessarily their representation in the
intermediate file.  As it is now, the width tables *do* permit the filter
to emit 8-bit characters to the printer - they *have* to.  On that note,
you might have trouble getting 8-bit characters to the printer, but that's
the OS's/system administrator's/printer designer's fault.

On the other hand, permitting ditroff to accept 8-bit characters on *input*
may get people into trouble when they try to mail something through a
7-bit path.  But it isn't all that difficult to solve, either by uuencoding
(or similar) or having a program that converts the 8-bit characters to the
\(xx convention (and vice-versa) (I'm in fact going to be implementing
something like this in Psroff).  People solve it all the time when shipping
PC binaries.

Requiring all of Europe to have to type those silly 4-character sequences
when trying to edit documents in their own language when 8-bit is *easy*
isn't a very nice thing to do to faithful customers.

And it isn't *just* Europe.  It's Canada too.  (I'm a member of the
CSA/Treasury Board Canadian Posix Working Group).  Canada is also trying
to encourage Latin-1 because of bilingual (English/French) requirements
both in government and in the private sector.  7-bit ASCII is very nearly
*only* the USA (most other English speaking countries are either tending
towards Latin-1, or a different version of 646.  646 in all it's national
variants only satisfies completely a minority of the Roman-alphabet countries).
Lest one think that Canada is a minor addition to Western Europe in this
context, one should remember that Canada is the US's biggest single trading
partner by a substantial margin (considerably larger than Japan).  The only
market bigger than Canada is the EEC taken as a unit (aka Western Europe,
aka the other Latin-1 countries).  Markets, markets!  Heck, if you're ever
even thinking about Kanji, you really should satisfy the considerably larger
group of customers that need Latin-1.

>The issue of inventing alternate representations, such as \(ao for "a ring"
>goes beyond the issue of simple 8-bit transparency.

It *isn't* transparency, on the contrary.  HOWEVER, having both the 8-bit
input transparency as well as alternate representations that need only
7-bit would be a definate plus, so that people with 646 compliant terminals
can do the same things that the newer Latin-1 ones can.

And it ain't all that hard to do.  Hell, if I can do it in psroff via
CAT troff without source, AT&T should be able to do it with ditroff.
-- 
Chris Lewis, Phone: (613) 832-0541
UUCP: uunet!utai!lsuc!ecicrl!clewis
Moderator of the Ferret Mailing List (ferret-request@eci386)
Psroff mailing list (psroff-request@eci386)

heimir@rhi.hi.is (Heimir Thor Sverrisson) (12/24/90)

In <1990Dec20.012516.23623@ico.isc.com> rcd@ico.isc.com (Dick Dunn) writes:

>			There are some methods of
>data interchange, such as most email systems, that are inherently 7-bit.
>It would be nice if we could just banish them, but compatibility is an
>albatross.

There is nothing that tells me that email systems should be _inherently 7-bit_.
In fact here in Iceland we have to hack almost every piece of communications
software to be able to use it in our _inherently 8-bit environment_.  This

I can see no reason at all for some stupid mailers to strip off the eighth bit
(including Interactive's version of sendmail).  They don't have to - and should
not - interpret the contents of the mail they are transmitting.  This is
quite different from the troff situation where a program has to know a lot
about it's input set.  So why don't you guys simply open up your mailers
and be ready with a 8-bit clean version by the end of 1991!
--
Heimir Thor Sverrisson			heimir@rhi.hi.is

keld@login.dkuug.dk (Keld J|rn Simonsen) (12/26/90)

I was the coauthor of an article on 7-bit names for troff of
non-ASCII characters, together with some support for national
ISO 646 variants. Here is the article. Enjoy!

: This is a shar archive.  Extract with sh, not csh.
: This archive ends with exit, so do not worry about trailing junk.
: --------------------------- cut here --------------------------
PATH=/bin:/usr/bin:/usr/ucb
echo Extracting 'Makefile'
sed 's/^X//' > 'Makefile' << '+ END-OF-FILE ''Makefile'
Xextchar.dit:	extchar specchar
X		refer -e -p typesetting extchar | tbl | troff $(DEVICE) -ms >extchar.dit
X
Xspecchar:	spec nicetr
X		sh nicetr spec >specchar
X
Xprint:		extchar.dit
X		dip $(DEVICE) extchar.dit
X
Xallchar:	spec nicetr spec.p400
X		sh nicetr spec spec.p400 >allchar
X
Xdistr:
X		shar Makefile extchar spec nicetr tmac.la pchdefs typesetting >distr
+ END-OF-FILE Makefile
chmod 'u=r,g=r,o=r' 'Makefile'
set `wc -c 'Makefile'`
count=$1
case $count in
351)	:;;
*)	echo 'Bad character count in ''Makefile' >&2
		echo 'Count should be 351' >&2
esac
echo Extracting 'extchar'
sed 's/^X//' > 'extchar' << '+ END-OF-FILE ''extchar'
X.ds LF DRAFT
X.so pchdefs
X.so tmac.la
X.la US
X.TL
XAn extension to the troff character set
X.AU
XE.G. Keizer
XK.J. Simonsen
XJ. Akkerhuis
X.AI
XVrije Universiteit, Amsterdam, The Netherlands
XUniversity of Copenhagen, Copenhagen, Denmark
XC.M.U., USA
X.AB
XThe typesettting program
X.I troff
Xwas originally written for formatting English text for the CAT 48 Typesetter.
XIts offspring is used for formatting a variety of languages with a large
Xdiversity of output devices.
XThe authors agreed on an addition
Xto the troff character set covering old and new national and
Xinternational latin based character sets.
X.AE
X.NH
XThe problems
X.PP
XWhen adapting the
X.UX
Xtypesetting program
X.I troff\|
X.[
XOssanna
X.]
Xto a new output device, one wants to have access
Xto the extra characters offered by the device,
Xwithout sacrificing any characters already in use.
XDevice independent \f2troff\fP, also called
X.I titroff
Xor 
X.I ditroff\|
X.[
XTypesetter independent troff Kernighan
X.]
Xuses a flexible font definition mechanism that allows addition and deletion
Xof characters.
X.LP
XMany people, including the authors, have used this mechanism
X.[
XKahrs and Moore
X.]
Xto add characters.
XThis has led to a diversity of names, with the expected conflicts
Xof using the same name for different characters and different names for the same
Xcharacter in different implementations.
XThus
X.I troff
Xinput files are becoming less and less portable,
Xeven for the same output device on different installations.
XWe regret this development and, during a conference in Copenhagen,
Xwe decided to make an attempt at some standardization.
XB.W. Kernighan, author of ditroff, agreed to our proposal of acting as a
Xclearing house for our new names and he still has to give his blessing
Xto this article.
X.PP
XWe realize that it is impossible to name every printable character in the
Xworld.
XThe total amount of different characters is simply too huge.
XNaming all the hundred different turtles in a turtle font is both
Xfrustrating and futile.
XWe restricted ourselves to the following categories:
X.IP \(bu
Xcharacters belonging to the printed language of several Western-European
Xcountries: \(AE \(ss
X.IP \(bu
Xvariations of letters, especially with accents: \(:i \(^u \(:o
X.IP \(bu
Xoften used mathematical symbols: \(AN \(OR \(c*
X.NH 2
XNatural language support
X.PP
XThe
X.I troff
Xcharacter set is based on the US-ASCII standard.
XThis standard is well suited to English text,
Xbut causes problems when used in most European countries.
XUS-ASCII is a version of the ISO 646-1983 standard.
XThe ISO standard states the characters used in 7 bit ASCII
Xand contains 94 printable graphic symbols.
XISO-646 allows national versions for 12 of its character positions.
XThe European Computer Manufacturers' Association ECMA is registering
Xall different national versions of ISO 646-1983 and assigns
Xa different character to each.
XThis allows the creation of documents with multiple character sets.
XThe assigned character serves to identify each character set in such documents.
XThe table below shows several versions conforming to ISO 646.
X.TS
Xcenter,allbox;
Xc s s s s s s s s s s s s s s
Xl l l s s s s s s s s s s s s
Xl l l c c c c c c c c c c c c.
XNational ISO 646 character sets
XCountry	Standard	.la parameter
XISO	ISO 646 IRV	ISO	\(sh	\(Cs	\(at	\(lB	\(rs	\(rB	\(ha	\(oq	\(lC	\(ba	\(rC	\(rn
XUSA	X3.4-1968	US	\(sh	\(Do	\(at	\(lB	\(rs	\(rB	\(ha	\(oq	\(lC	\(ba	\(rC	\(ti
XGreat Britain	BS 4730	GB	\(Po	\(Do	\(at	\(lB	\(rs	\(rB	\(ha	\(oq	\(lC	\(ba	\(rC	\(rn
XJapan	JIS C 6229	JP	\(sh	\(Do	\(at	\(lB	\(Ye	\(rB	\(ha	\(oq	\(lC	\(ba	\(rC	\(rn
XChina	GB 1988-80	CN	\(sh	\(Ye	\(at	\(lB	\(rs	\(rB	\(ha	\(oq	\(lC	\(ba	\(rC	\(rn
XDenmark	DS 2089	DK	\(sh	\(Do	\(at	\(AE	\(/O	\(oA	\(ha	\(oq	\(ae	\(/o	\(oa	\(ti
XNorway	NS 4551-1	NO	\(sh	\(Do	\(at	\(AE	\(/O	\(oA	\(ha	\(oq	\(ae	\(/o	\(oa	\(rn
X.la NO2
X	NS 4551-2	NO2	#	$	@	[	\(/O	]	^	`	{	|	}	~
X.la FI
XFinland		FI	#	$	@	[	\e	]	^	`	{	|	}	~
XSweden	SEN 850200 B	SE	\(sh	\(Cs	\(at	\(:A	\(:O	\(oA	\(:U	\(oq	\(:a	\(:o	\(oa	\(rn
X	SEN 850200 C	SE2	\(sh	\(Cs	\*('E	\(:A	\(:O	\(oA	\(:U	\('e	\(:a	\(:o	\(oa	\(:u
XGermany	DIN 66 003	DE	\(sh	\(Do	\(sc	\(:A	\(:O	\(:U	\(ha	\(oq	\(:a	\(:o	\(:u	\(ss
XHungary	MSZ 7795/3	HU	\(sh	\(Cs	\*('A	\*('E	\(:O	\(:U	\(ha	\('a	\('e	\(:o	\(:u	\(a"
X.\" Jugoslavia	JUS I.B1.002	JS	\(sh	\(Do	\*(vZ	\*(vS	\*(-D	\*('C	\*(vC	\*(vz	\*(vs	\*(-d	\*(vc	\*('c
X.la FR
XFrance	NF Z 62-010	FR	\(Po	\(Do	\(`a	\(de	\(,c	\(sc	\(ha	\(my	\('e	\(`u	\(`e	\(ad
XItaly		IT	\(Po	\(Do	\(sc	\(de	\(,c	\('e	\(ha	\(`u	\(`a	\(`o	\(`e	\(`i
X.la ES
XSpain		ES	#	$	@	[	\e	]	^	`	{	|	}	~
X.la ES2
X		ES2	#	$	@	[	\e	]	^	`	{	|	}	~
XPortugal		PT	\(sh	\(Do	\(sc	\*(~A	\(,C	\*(~O	\(ha	\(oq	\(~a	\(,c	\(~o	\(de
X		PT2	\(sh	\(Do	\(aa	\*(~A	\(,C	\*(~O	\(ha	\(oq	\(~a	\(,c	\(~o	\(ti
X.TE
X.la US
X.vs
X.ps
X.PP
XTerminals in these countries are often
Xadapted to these national variations.
XCreating
X.I troff
Xinput for Danish texts on Danish terminals is a frustrating experience.
XOne has to type
X.B \e(AE
Xfor
X.B \(AE
Xin spite of the presence of a special key for \(AE.
XYou can by-pass this by using the
X.I troff
Xcommand \f3.tr\fP,
Xwhich allows the mapping of any character to any other.
XWe have employed the \f3.tr\fP command in a macro \f3.la\fP
Xwhich is designed to make it possible to shift between
Xall the ISO 646 input charater sets in the above table.
XThe macro takes a code for the country as parameter; the first
Xtwo letters being the ISO 3166 two-letter country code.
XThe \f3.la\fP macro can be used like this:
X
X.ft CW
X.nf
X    .la US
X    First we write something in "God's own" character set.
X    .la DK
X    S} skriver vi noget s|dt p} dansk: sodavandsis.
X    .la DE
X    Und f}r Deutschen k|nnen wir auch etwas schreiben!
X    .la US
X.fi
X
X.ft
Xgiving:
X.br
X.ft
XFirst we write something in "God's own" character set.
X.la DK
XS} skriver vi noget s|dt p} dansk: sodavandsis.
X.la DE
XUnd f}r Deutschen k|nnen wir auch etwas schreiben!
X.la US
X.ft
X
XThe \f3.la\fP
Xmacro can only be used when all the characters of the character sets in use
Xhave a unique code on the printing device. Also you cannot change input
Xcharacter set within a diversion in 
X.I troff,
Xyou need to use the special character names if you want to use foreign
Xcharacters within a sentence. An example of having a French name
Xin a Danish text:
X
X.ft CW
X.nf
XJeg s} Jer\e(^ome Fran\e(,cois komme til K|benhavn.
X.ft
Xgiving:
X.ft
X.la DK
XJeg s} Jer\(^ome Fran\(,cois komme til K|benhavn.
X.ft
X.fi
X.la US
X.PP
XTo be able to use all other national characters within a national
Xcharacter set,
Xwe decided to introduce names for all the different national
Xcharacters,
Xeven for the 'default' US names.
X.PP
XAlso we went through the new ISO standards for Latin alphabets
X(ISO 8859)
Xand assured that all special characters there would have a unique
Xname according to this proposal.
X.PP
XA last warning is about the 
X.I troff 
Xescape character \e and the national characters taking its place.
XHere you must write the national character followed by an 'e'
Xto get the desired result.
X
XThe \f3.la\fP macro has the follwing contents:
X.DS
X.ft CW
X.ps 6
X.nr Sw \w' '
X.ta 8u*\n(Swu 16u*\n(Swu 24u*\n(Swu 32u*\n(Swu 40u*\n(Swu 48u*\n(Swu 56u*\n(Swu 
X.eo
X.nf
X.cc &
X&so tmac.la
X&cc .
X.fi
X.ec
X.ps
X.DE
X.ft 1
X.NH 2
XProblems we did not pursue
X.PP
XWritten English is based on the latin alphabet,
Xit hardly uses accents and other variations of letters.
XOther languages make use of accents above (\(`e), below (\(,c) and
Xthrough (\(/o) the letters.
XWe do not address the problems of languages with more than one
Xaccent per letter and
X. \"(\o'v\(aa'\h'-\w'v'u'\v'0.1m',.\v'-0.1m'\h'0.2m'),
Xaccents connecting letters.
XIn our opinion these problems have to be solved by separate preprocessors.
XThis allows a much more friendly user interface.
XThese preprocessors could also solve the problem of hyphenation
Xand ligatures for these languages.
X.NH 2
XThe troff naming scheme
X.PP
X.I Troff
Xhas three ways of naming characters:
X.IP \(bu
Xone character ASCII names like
X.B A
Xfor A,
X.B B
Xfor B
Xand
X.B @
Xfor @.
X.IP \(bu
Xescaped one character names prefixed by a
X.B \e
Xlike
X.B \e\-
Xfor current font minus,
X.B \ee
Xfor backslash
Xand
X.B \e\'
Xfor acute accent.
XThere are only a few of these.
X.IP \(bu
Xtwo character names prefixed by the indicator
X.B \e(
Xlike
X.B \e(sc
Xfor \(sc,
X.B \e(*g
Xfor \(*g
Xand
X.B \e(14
Xfor \(14.
X.LP
XThe sets of one-character names and escaped one character names are fixed.
XOnly the set of two character names can be extended.
X.NH
XChoosing new names
X.PP
XWhile choosing names for new characters we were very much aware
Xof the fact that the restriction of two characters per name
Xdefies all attempts to choose a consistent and logical naming scheme.
XStill we used a few principles in choosing the new names.
XWhenever these principles conflicted we refrained from long
Xdiscussions but placed more value on a quick decision.
X.LP
XOur principles:
X.IP \(bu
Xmay not conflict with original troff manual
X.[
XOssanna
X.]
X.IP \(bu
Xtry to avoid the national characters: 
X\(sh \(Do \(at \(lB \(rs \(rB \(ha \(oq \(lC \(ba \(rC \(ti
X.IP \(bu
Xuse characters associated with the graphical description of the symbol.
XThus
X.B \e(oA
Xfor
X.B \(oA
Xinstead of \f3\e(AA\f1.
X.IP \(bu
XWhenever a character is a combination of an accent and a letter
Xuse as name a character representing the accent followed by the letter.
XFor example: \f3\e('e\f1 for \('e.
XThis is the way these combinations were made on old-fashioned
Xtypewriters:
Xfirst hit the dead key with the accent then the key with the letter.
XThe characters we used for the accents (and the like) are:
X.br
X.sp 0.6
X.TS
Xallbox,center;
Xl  c  c  c  c  c  c  c  c  c  c  c  c  c.
XASCII	\(aa	\(ga	:	,	\(ti	o	\(ha	.	"	u	v	-	/
XAccent	\(aa	\(ga	\(ad	\(ac	\(ti	\(de	\(ha	.	\(a"	\(ab	\(ah	\(hy	\(sl
XName	aa	ga	ad	ac	ti	de	ha	a.	a"	ab	ah	hy	sl
X.TE
X.IP \(bu
Xfractions are done in the natural way: \f3\e(\f2nm\f1
Xfor \s-3\v'-0.45m'n\v'0.45m'\s0/\s-3m\s0.
X.IP \(bu
XCurrency signs have two letter names.
XThe first letter is Capital, the second small.
X.IP \(bu
XThe names for accents start with an \f3a\fP, for example
X.B \e(ad
Xfor \(ad.
X.NH
XThe weird characters one has
X.LP
XSuppose you have a 
X.I flat,
Xwhich is unlikely on any other device, but
Xstill want to have your input send to your friend/editor which can run
Xit on his troff.
XThe first thing to do is not using the character as it is
Ximplemented on your troff (f.i. \e(ft), but define a
Xstring at the beginning of the text and use that.
XThe next thing to do is to give a description how it should look like.
XSo if we stick with our flat, you end up at the start of the your
Xinput file with:               
X.DS
X.ft CW
X .\e" We use a "flat" (\e(ft) a lot instead of a backslash, because it
X .\e" stands out nicely and, since a backslash can be interpreted in
X .\e" so many ways, I want to make clear that you see an escape when I
X .\e" talk about it....  
X .\e" So I define the string ft
X .ds ft \e(ft            
X .\e" A "flat" is what in music stands for lowering the current
X .\e" tone. If that not is clear, consider a | (pipe character) with
X .\e" an small circle attached to it on the left bottom side.
X .\e" A define in the style of
X .\e"     .ds \eo'|o'     
X .\e" will do if you don't have anything more than a lineprinter around.
X .\e" If you want to be really fancy, you might want to try
X .\e" something like:    
X .\e" .nr x \ew'o'/2u     
X .\e" .nr y .2m          
X .\e" .ds ft \ev'-\enyu'\ez|\eh'\enxu'\ev'\enyu'\eS'-9'o\eS'0'
X .\e" Fancy ain't it?    
X.DE
X.ft 1
XSo the rest of your article will look (at the input side) as:
X.DS
X.ft CW
X An escape (denoted by \e*(ft) in troff will introduce a two
X character name XX by \e*(ft( (so \e*(ft(XX) and a two character
X named QQ string interpolation will be triggered
X by \e*(ft*( (\e*(ft*(QQ).
X.DE
X.NH
XThe future
X.PP
XNew versions of
X.I troff
Xinclude a more forms of naming characters.
XBoth \f3\eC\'\f2arbitray\ long\ name\f3\'\f1 and
X\f3\eN\'\f2absolute\ Number\ of\ character\ in\ current\ font\f3\'\f1
Xare used.
XThe first allows character names of arbitrary length.
XThe latter allows unnnamed characters.
XThis does not solve the problems discussed in this article.
XWorse, it will even be harder to choose names upon which
Xa sizeable number of people will agree.
X.NH
XBibliography
X.LP
X.[
X$LIST$
X.]
X.bp
X.NH
XThe character set
X.LP
X.so specchar
+ END-OF-FILE extchar
chmod 'u=r,g=r,o=r' 'extchar'
set `wc -c 'extchar'`
count=$1
case $count in
12207)	:;;
*)	echo 'Bad character count in ''extchar' >&2
		echo 'Count should be 12207' >&2
esac
echo Extracting 'spec'
sed 's/^X//' > 'spec' << '+ END-OF-FILE ''spec'
XWe divided the new characters in four categories.
XEach character is mentioned only once.
XWhenever we doubted we tried to place a character in the category we thought was
Xmost suitable. Lastly we included the characters from Ossanna's 
Xtroff document for reference.
X.NH 2
XSymbols from ISO 646 standards
X.LP
X.TS
Xsh		sharp
XYe		Yen
XCs		Currency sign
XDo		Dollar
XPo		English pound
Xat		at sign
XlB		left square bracket
Xrs		backslash
XrB		right square bracket
Xha		hat or accent circumflex
XlC		left curly bracket
XrC		right curly bracket
Xba		bar (possibly broken)
Xti		tilde
Xa"		accent double quote
XAE		AE
X/O		O slash
XoA		A circle
Xae		ae
X/o		o slash
Xoa		a circle
X:A		A diaeresis
X:O		O diaeresis
X:U		U diaeresis
X'e		e acute accent
X:a		a diaeresis
X:o		o diaeresis
X:u		u diaeresis
Xss		German ringel S
X`a		a grave accent
X,c		c cedilla
X`e		e grave accent
X`u		u grave accent
X`o		o grave accent
X`i		i grave accent
Xr!		reverse !
X~N		N tilde
Xr?		reverse ?
X~n		n tilde
X~a		a tilde
X^a		a hat
X,C		C cedilla
X^e		e hat
X^i		i hat
X~o		o tilde
X^u		u hat
X'E	\*('E	E acute accent
X'A	\*('A	A acute accent
X'a		a acute accent
X'i		i acute accent
X'c	\*('c	c acute accent
X'C	\*('C	C acute accent
X~O	\*(~O	O tilde
X~A	\*(~A	A tilde
X-d		d bar
X-D		Capital Icelandic Eth (D) / D bar
X.TE
X.ne 10
X.NH 2
XSymbols from ISO 8859 standards
X.LP
X.TS
XSd		small Icelandic eth (d)
X/l		Polish l
X/L		Polish L
XTp		Small Icelandic Thorn
XTP		Capital Icelandic Thorn
Xbb		broken bar
XS1	\h'0.9n'\v'-1n'\s-3\&1\s0\v'1n'	superscript 1
XS2	\h'0.9n'\v'-1n'\s-3\&2\s0\v'1n'	superscript 2
XS3	\h'0.9n'\v'-1n'\s-3\&3\s0\v'1n'	superscript 3
X:e		e diaeresis
X`E		E grave accent
X'o		o acute accent
X^o		o hat
X'u		u acute accent
XOf		feminin ordinator indicator
XOm		masculin ordinator indicator
XFo		double french open quote
XFc		double french close quote
Xac		accent cedilla
Xad		accent diaeresis
Xps		english paragraph sign
X:i		i diaeresis
X12		one half
X14		one quart
X34		three quart
Xmd		centered dot
Xno		not
X.TE
X.ne 10
X.NH 2
XTypographical characters
X.LP
X.TS
Xab		accent breve
Xao		accent corona
Xah		accent ha\o'c\(ah'ek
Xa.		accent dot
Xho		hook
X.i		dotless i
X.j		dotless j
Xfo		french open quote
Xfc		french close quote
XIJ	IJ	ligature IJ
Xij		ligature ij
Xtm		Trade Mark
Xoq		open quote
Xoe		French oe
XOE		French OE
XOK		check mark
X.TE
X.ne 20
X.NH 2
XMathematical characters
X.LP
X.TS
X%0		per mille
X-h		h bar
Xsd		second sign
Xc+		circle plus
Xc*		circle times
X>~		approximately greater
X<~		approximately less
X<<		much less
X>>		much greater
X=~		approximately equal
XOR		logical or
XAN		logical and
Xfa		for all
Xte		there exists
X3d		therefore
Xpp		perpendicular to
X/_		angle
X!<		not less
X!>		not greater
Xnm		not a member
Xcn		contains
Xnc		does not contain
X~~		approximately
XAh		Aleph
Xne		not equivalent
X-+		minus plus
X.TE
X.NH 2
XSymbols from the Troff manual by Ossanna.
X.LP
X.TS
Xem		3/4 Em dash
Xhy		hyphen
Xbu		bullet
Xsq		square
Xru		rule
X14		1/4
X12		1/2
X34		3/4
Xfi		fi
Xfl		fl
XFi		ffi
Xff		ff
XFl		ffl
Xde		degree
Xdg		dagger
Xfm		foot mark
Xct		cent sign
Xrg		registered
Xco		copyright
Xpl		math plus
Xmi		math minus
Xeq		math equals
X**		math star
Xsc		section
Xaa		acute accent
Xga		grave accent
Xul		underline
Xsl		slash (matching backslash)
X*a		alpha
X*b		beta
X*g		gamma
X*d		delta
X*e		epsilon
X*z		zeta
X*y		eta
X*h		theta	
X*i		iota
X*k		kappa
X*l		lambda
X*m		mu
X*n		nu
X*c		xi
X*o		omricron
X*p		pi
X*r		rho
X*s		sigma
Xts		terminal sigma
X*t		tau
X*u		upsilon
X*f		phi
X*x		chi
X*q		psi
X*w		omega
X*A		Alpha
X*B		Beta
X*G		Gamma
X*D		Delta
X*E		Epsilon
X*Z		Zeta
X*Y		Eta
X*H		Theta	
X*I		Iota
X*K		Kappa
X*L		Lambda
X*M		Mu
X*N		Nu
X*C		Xi
X*O		Omricron
X*P		Pi
X*R		Rho
X*S		Sigma
X*T		Tau
X*U		Upsilon
X*F		Phi
X*X		Chi
X*Q		Psi
X*W		Omega
Xsr		square root
Xrn		root en extender
X>=		>=
X<=		<=
X==		identical equal
X~=		approx =
Xap		approximates
X!=		not equal
X->		right arrow
X<-		left arrow
Xua		up arrow
Xda		down arrow
Xmu		multiply
Xdi		divide
X+-		plus-minus
Xcu		cup (union)
Xca		cap (intersection)
Xsb		subset of
Xsp		superset of
Xib		improper subset
Xip		improper superset
Xif		infinity
Xpd		partial deriative
Xgr		gradient
Xnp		not
Xis		integral sign
Xpt		proportional to
Xes		empty set
Xmo		member of
Xbr		box vertical rule
Xdd		double dagger
Xrh		right hand
Xlh		left hand
Xbs		Bell System logo
Xor		or
Xci		circle
Xlt		left top of big curly bracket
Xlb		left bottom
Xrt		right top
Xrb		right bottom
Xlk		left center
Xrk		right center
Xbv		bold verticel
Xlf		left floor
Xrf		right floor
Xlc		left ceiling
Xrc		right ceiling
X.TE
+ END-OF-FILE spec
chmod 'u=r,g=r,o=r' 'spec'
set `wc -c 'spec'`
count=$1
case $count in
4339)	:;;
*)	echo 'Bad character count in ''spec' >&2
		echo 'Count should be 4339' >&2
esac
echo Extracting 'nicetr'
sed 's/^X//' > 'nicetr' << '+ END-OF-FILE ''nicetr'
Xawk 'BEGIN {
X	print ".tr '"'"'\\'"'"'`\\`"
X	print ".vs 12p"
X	print ".ps 10"
X	FS="	"
X}
X/^.TS/ {
X	print ".TS H"
X	print "lw(1c) lw(1c) lw(5c) lw(1c) lw(1c) lw(5c)."
X	print "\\f3Char\\fP\t\\f3Name\\fP\t\t\\f3Char\\fP\t\\f3Name\\fP"
X	print ""
X	print ".TH"
X	tabling=1
X	no_entries=0
X}
X/^.TE/ {
X	tabling=0
X	if ( no_entries%2==1 ) printf "\n"
X}
Xtabling==1 && $1!=".TS" {
X  no_entries +=1
X  if ( $2=="" ) {
X	printf "\\(%s\t\\e(%s\t%s", $1, $1, $3
X  } else {
X	printf "%s\t\\e(%s\t%s", $2, $1, $3
X  }
X  if ( no_entries%2==0 ) printf "\n"
X  else printf "\t"
X}
Xtabling==0
XEND {
Xprint ".tr '"''"'``"
Xprint ".ps 10"
X}
X' $* | tbl
+ END-OF-FILE nicetr
chmod 'u=r,g=r,o=r' 'nicetr'
set `wc -c 'nicetr'`
count=$1
case $count in
613)	:;;
*)	echo 'Bad character count in ''nicetr' >&2
		echo 'Count should be 613' >&2
esac
echo Extracting 'tmac.la'
sed 's/^X//' > 'tmac.la' << '+ END-OF-FILE ''tmac.la'
X.de la		
X.\" languages - keld@dkuug.dk & storm@dkuug.dk
X.\" Covers all ECMA registrered versions of ISO 646
X.\" Countries according to ISO 3166
X.\" Commented is registered ECMA char code and standard number
X.fl
X.ie \\n(.$=0 .ds )L \\*(=L
X.el .ds )L \\$1
X.ds =L \\*(LA
X.ds LA \\*()L
X.rm )L
X.if "\\*(LA"DK"  .tr #\(sh$\(Do@\(at[\(AE\\\\\(/O]\(oA^\(ha\`\(ga{\(ae|\(/o}\(oa~\(ti  \"   DS 2089
X.if "\\*(LA"US"  .tr #\(sh$\(Do@\(at[\(lB\\\\\(rs]\(rB^\(ha\`\(ga{\(lC|\(ba}\(rC~\(ti  \" B X3.4-1968
X.if "\\*(LA"ISO" .tr #\(sh$\(Cs@\(at[\(lB\\\\\(rs]\(rB^\(ha\`\(ga{\(lC|\(ba}\(rC~\(rn  \" @ IRV
X.if "\\*(LA"GB"  .tr #\(Po$\(Do@\(at[\(lB\\\\\(rs]\(rB^\(ha\`\(ga{\(lC|\(ba}\(rC~\(rn  \" A BS 4730
X.if "\\*(LA"DE"  .tr #\(sh$\(Do@\(sc[\(:A\\\\\(:O]\(:U^\(ha\`\(ga{\(:a|\(:o}\(:u~\(ss  \" K DIN 66 003
X.if "\\*(LA"FR"  .tr #\(Po$\(Do@\(`a[\(de\\\\\(,c]\(sc^\(ha\`\(mu{\('e|\(`u}\(`e~\(ad  \" f NF Z 62-010 (1982)
X.if "\\*(LA"CN"  .tr #\(sh$\(Ye@\(at[\(lB\\\\\(rs]\(rB^\(ha\`\(ga{\(lC|\(ba}\(rC~\(rn  \" T GB 1988-80
X.if "\\*(LA"JP"  .tr #\(sh$\(Do@\(at[\(lB\\\\\(Ye]\(rB^\(ha\`\(ga{\(lC|\(ba}\(rC~\(rn  \" n JIS C 6229-1984
X.if "\\*(LA"IT"  .tr #\(Po$\(Do@\(sc[\(de\\\\\(,c]\('e^\(ha\`\(`u{\(`a|\(`o}\(`e~\(`i  \" Y
X.if "\\*(LA"ES"  .tr #\(Po$\(Do@\(sc[\(r!\\\\\(~N]\(r?^\(ha\`\(ga{\(de|\(~n}\(,c~\(ti  \" Z
X.if "\\*(LA"ES2" .tr #\(sh$\(Do@\(bu[\(r!\\\\\(~N]\(,C^\(r?\`\(ga{\(aa|\(~n}\(,c~\(ad  \" h
X.if "\\*(LA"PT"  .tr #\(sh$\(Do@\(sc[\(~A\\\\\(,C]\(~O^\(ha\`\(ga{\(~a|\(,c}\(~o~\(de  \" L
X.if "\\*(LA"PT2" .tr #\(sh$\(Do@\(aa[\(~A\\\\\(,C]\(~O^\(ha\`\(ga{\(~a|\(,c}\(~o~\(ti  \" g
X.if "\\*(LA"HU"  .tr #\(sh$\(Cs@\('A[\('E\\\\\(:O]\(:U^\(ha\`\('a{\('e|\(:o}\(:u~\(a"  \" i MSZ 7795/3
X.if "\\*(LA"NO"  .tr #\(sh$\(Do@\(at[\(AE\\\\\(/O]\(oA^\(ha\`\(ga{\(ae|\(/o}\(oa~\(rn  \" ` NS 4551 - 1
X.if "\\*(LA"NO2" .tr #\(sc$\(Do@\(at[\(AE\\\\\(/O]\(oA^\(ha\`\(ga{\(ae|\(/o}\(oa~\(bv  \" a NS 4551 - 2
X.if "\\*(LA"SE"  .tr #\(sh$\(Cs@\(at[\(:A\\\\\(:O]\(oA^\(ha\`\(ga{\(:a|\(:o}\(oa~\(rn  \" G SEN 850200 B
X.if "\\*(LA"SE2" .tr #\(sh$\(Cs@\('E[\(:A\\\\\(:O]\(oA^\(:U\`\('e{\(:a|\(:o}\(oa~\(:u  \" H SEN 850200 C
X.if "\\*(LA"FI"  .tr #\(sh$\(Do@\(at[\(:A\\\\\(:O]\(oA^\(ha\`\(ga{\(:a|\(:o}\(oa~\(ti
X.if "\\*(LA"JS"  .tr #\(sh$\(Do@\(vZ[\(vS\\\\\(-D]\('C^\(vC\`\(vz{\(vs|\(-d}\(vc~\('c  \" z JUS I.B1.002
X..
+ END-OF-FILE tmac.la
chmod 'u=r,g=r,o=r' 'tmac.la'
set `wc -c 'tmac.la'`
count=$1
case $count in
2287)	:;;
*)	echo 'Bad character count in ''tmac.la' >&2
		echo 'Count should be 2287' >&2
esac
echo Extracting 'pchdefs'
sed 's/^X//' > 'pchdefs' << '+ END-OF-FILE ''pchdefs'
X.\" The following characters are not defined on the Agfa P400 printer
X.ds 'E E\h'-\w"E"u-\w"\(aa"u/2u'\v'-.19m'\(aa\v'.19m'
X.ds 'A A\h'-\w"A"u-\w"\(aa"u/2u'\v'-.19m'\(aa\v'.19m'
X.ds ~A A\h'-\w"A"u-\w"\(ti"u/2u'\v'-.19m'\(ti\v'.19m'
X.ds ~O O\h'-\w"O"u-\w"\(ti"u/2u'\v'-.19m'\(ti\v'.19m'
X.ds 'C C\h'-\w"C"u-\w"\(aa"u/2u'\v'-.19m'\(aa\v'.19m'
X.ds 'c c\h'-\w"c"u-\w"\(aa"u/2u'\v'-.19m'\(aa\v'.19m'
X.\" .ds vS S\h'-\w"S"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m'
X.\" .ds vs s\h'-\w"s"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m'
X.\" .ds vc c\h'-\w"c"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m'
X.\" .ds vC C\h'-\w"C"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m'
X.\" .ds vZ Z\h'-\w"Z"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m'
X.\" .ds vz z\h'-\w"z"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m'
X.\" .ds -D D\h'-\w"D"u-\w"\(hy"u/2u'\v'-.19m'\(hy\v'.19m'
X.\" .ds -d d\h'-\w"d"u-\w"\(hy"u/2u'\v'-.19m'\(hy\v'.19m'
+ END-OF-FILE pchdefs
chmod 'u=r,g=r,o=r' 'pchdefs'
set `wc -c 'pchdefs'`
count=$1
case $count in
858)	:;;
*)	echo 'Bad character count in ''pchdefs' >&2
		echo 'Count should be 858' >&2
esac
echo Extracting 'typesetting'
sed 's/^X//' > 'typesetting' << '+ END-OF-FILE ''typesetting'
X%T Adventures with Typesetter\-Independent TROFF
X%A Mark Kahrs
X%A Lee Moore
X%R TR 159
X%C Rochester, NY
X%D June, 1985
X%I University of Rochester
X
X%A J. F. Ossanna
X%T NROFF/TROFF User's Manual
X%D October 1976
X%R Comp. Sci. Tech. Rep. 54
X%I Bell Laboratories
X%C Murray Hill, NJ
X
X%A B.W. Kernighan
X%T A typesetter-independent TROFF
X%D March 1982
X%R Comp. Sci. Tech. Rep. 97
X%I Bell Laboratories
X%C Murray Hill, NJ
X
X%A J. Akkerhuis
X%T Unknown
X%J Proceedings of the European Unix\(tm System User Group Autumn Meeting
X%D 7-9 september 1981
+ END-OF-FILE typesetting
chmod 'u=rw,g=r,o=r' 'typesetting'
set `wc -c 'typesetting'`
count=$1
case $count in
533)	:;;
*)	echo 'Bad character count in ''typesetting' >&2
		echo 'Count should be 533' >&2
esac
exit 0

hansen@pegasus.att.com (Tony L. Hansen) (12/27/90)

< From: heimir@rhi.hi.is (Heimir Thor Sverrisson)

<< There are some methods of data interchange, such as most email systems,
<< that are inherently 7-bit.  It would be nice if we could just banish
<< them, but compatibility is an albatross.

< There is nothing that tells me that email systems should be _inherently
< 7-bit_.  In fact here in Iceland we have to hack almost every piece of
< communications software to be able to use it in our _inherently 8-bit
< environment_.  This
< 
< I can see no reason at all for some stupid mailers to strip off the
< eighth bit (including Interactive's version of sendmail).  They don't
< have to - and should not - interpret the contents of the mail they are
< transmitting.  This is quite different from the troff situation where a
< program has to know a lot about it's input set.  So why don't you guys
< simply open up your mailers and be ready with a 8-bit clean version by
< the end of 1991!

1991? Why not now? The System V release 4.0 mail program is completely 8-bit
clean! (If you can find anyplace within SVr4 mail which isn't, I'll
personally guarantee that the next version of mail which comes from UNIX
System Laboratories will have a fix for the problem.)

By the way, the SMTP protocol doesn't permit 8-bit data. This limits mailers
which must send mail using that protocol.

					Tony Hansen
				att!pegasus!hansen, attmail!tony
				    hansen@pegasus.att.com

keld@login.dkuug.dk (Keld J|rn Simonsen) (12/28/90)

hansen@pegasus.att.com (Tony L. Hansen) writes:

>< From: heimir@rhi.hi.is (Heimir Thor Sverrisson)

>< I can see no reason at all for some stupid mailers to strip off the
>< eighth bit (including Interactive's version of sendmail).  They don't
>< have to - and should not - interpret the contents of the mail they are
>< transmitting.  This is quite different from the troff situation where a
>< program has to know a lot about it's input set.  So why don't you guys
>< simply open up your mailers and be ready with a 8-bit clean version by
>< the end of 1991!

>1991? Why not now? The System V release 4.0 mail program is completely 8-bit
>clean! (If you can find anyplace within SVr4 mail which isn't, I'll
>personally guarantee that the next version of mail which comes from UNIX
>System Laboratories will have a fix for the problem.)

I have also done patches to sendmail 5.61 and 5.64 to do 8-bit mail.
The patches require that you also have IDA installed.
Actually it handles quite some different 8-bit character sets like
8859-1 8859-2 and the rest of the 8859 family and vendor character
sets like the IBM codepages. In all about 60 character sets are
supported in the current release.

>By the way, the SMTP protocol doesn't permit 8-bit data. This limits mailers
>which must send mail using that protocol.

My package also has provisions for sending 8-bit mail thru 7-bit
SMTP in an "encoded ASCII" mode. 

My package is avaliable in dkuug.dk:pub/sm.8+bit.pa sm5.64.8+bit.pa
and ch.shar by anon ftp. By mail you can get it by mailing

                 mail uunet!dkuug.dk!archive
                 Subject: files pub
                 Names: sm.8+bit.pa sm5.64.8+bit.pa ch.shar

Its about 100 kb. Enjoy!

Keld Simonsen

tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) (12/28/90)

hansen@pegasus.att.com (Tony L. Hansen) writes:
> 
> By the way, the SMTP protocol doesn't permit 8-bit data. This limits
> mailers which must send mail using that protocol.

True.  But there is no technical reason (other than short-sightedness)
why SMTP has to strip off the 8th (high) bit.  There are in fact working
versions of sendmail that don't disturb the 8th bit.

bruce@balilly.UUCP (Bruce Lilly) (12/28/90)

In article <1990Dec27.043500.27639@cbnewsk.att.com> hansen@pegasus.att.com (Tony L. Hansen) writes:
>< From: heimir@rhi.hi.is (Heimir Thor Sverrisson)
>
><< There are some methods of data interchange, such as most email systems,
><< that are inherently 7-bit.  It would be nice if we could just banish
><< them, but compatibility is an albatross.
>
><   So why don't you guys
>< simply open up your mailers and be ready with a 8-bit clean version by
>< the end of 1991!
>
>1991? Why not now? The System V release 4.0 mail program is completely 8-bit
>clean! (If you can find anyplace within SVr4 mail which isn't, I'll
>personally guarantee that the next version of mail which comes from UNIX
>System Laboratories will have a fix for the problem.)

Is AT&T SVR4 available for the AT&T 3B1 and/or 7300 UNIX(TM)PC?
   ^^^^                        ^^^^
No? You've got to be kidding.

Reality check: it is impossible for everybody everywhere to upgrade (if
that's the right term (it might be)) to version N of any software.  Even
if economic limitations are ignored, other considerations, such as
inertia, inadequate disk space, compatibility requirements with other
software and (ahem :-) lack of vendor support prevent many people from
upgrading.

Of course, if you can make a port of SVR4 available at a reasonable price
for the 3B1, I'll upgrade. (i.e. put up or shut up)

Don't take this too seriously -- I'm really a big fan of AT&T. I just get
somewhat aggravated when I'm told (by AT&T) that I can't purchase (at any
price) AT&T software to run on my AT&T hardware (like the time I asked if
WWB was available for the 3B1).

>					Tony Hansen
>				att!pegasus!hansen, attmail!tony
>				    hansen@pegasus.att.com


--
	Bruce Lilly		blilly!balilly!bruce@sonyd1.Broadcast.Sony.COM

hansen@pegasus.att.com (Tony L. Hansen) (12/31/90)

<< From: hansen@pegasus.att.com (Tony L. Hansen)
<< By the way, the SMTP protocol doesn't permit 8-bit data. This limits
<< mailers which must send mail using that protocol.

< From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill)
< True.  But there is no technical reason (other than short-sightedness)
< why SMTP has to strip off the 8th (high) bit.  There are in fact
< working versions of sendmail that don't disturb the 8th bit.

I agree completely, there is no reason to limit SMTP to 7-bits.
Unfortunately, the standard currently REQUIRES the stripping and doing
anything else is non-standard. I would definitely support changing the
standard to allow an arbitrary 8-bit byte stream. This would also require
eliminating the limitation of 1024-byte lines and anything else in the
standard which is not content transparent.

System V release 4 mail is completely content transparent. As long as the
transport media is capable of handling the mail, SVr4 mail will be able to
get it to you unchanged. Unfortunately, it can't do so over SMTP
connections.

Since this discussion is going somewhat away from the bounds of comp.text,
I've added comp.mail.misc to the Newsgroup list.

					Tony Hansen
				att!pegasus!hansen, attmail!tony
				    hansen@pegasus.att.com

barmar@think.com (Barry Margolin) (12/31/90)

In article <1990Dec31.004055.10335@cbnewsk.att.com> hansen@pegasus.att.com (Tony L. Hansen) writes:
>System V release 4 mail is completely content transparent. As long as the
>transport media is capable of handling the mail, SVr4 mail will be able to
>get it to you unchanged.

What does it do when sending textual mail to a system that doesn't use
ASCII encoding, e.g. an IBM mainframe, or to a system with a different
newline convention (e.g. CRLF rather than LF)?  SMTP places restrictions on
the characters that may appear in a message to support automated
translation during the transfer process.

--
Barry Margolin, Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar

keld@login.dkuug.dk (Keld J|rn Simonsen) (01/02/91)

hansen@pegasus.att.com (Tony L. Hansen) writes:

>< From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill)
>< True.  But there is no technical reason (other than short-sightedness)
>< why SMTP has to strip off the 8th (high) bit.  There are in fact
>< working versions of sendmail that don't disturb the 8th bit.

This introduces a problem with "embedded slashes" which are now
represented internally in Sendmail with the 8th bit set.
Have anybody got Sendmail patches to remedy this?

>I agree completely, there is no reason to limit SMTP to 7-bits.
>Unfortunately, the standard currently REQUIRES the stripping and doing
>anything else is non-standard. I would definitely support changing the
>standard to allow an arbitrary 8-bit byte stream. This would also require
>eliminating the limitation of 1024-byte lines and anything else in the
>standard which is not content transparent.

I am much in favour of extending the character set supported by SMTP.
But you should be careful. What is the meaning of a 8-bit character?

Well, depends on the character set employed. Today we know that only 7-bit
ASCII is allowed. But with 8-bit mail, is this octal code 0162 coming
over the line  an "small a with acute accent" (as in ISO 8859-1:1987), a
Cent sign (as in IBM CP 437) or a "capital A with circumflex" (as
in HP Roman8)? This might become a real problem given the current
shares on the UNIX market. Just displaying the 8bit data to a user
may be very confusing. It may even do strange things to your terminal equipment
if IBM Codepage character set is employed, as some of the characters
here are in the upper control character sets of ISO 8859 and 
other vendors chararacter sets.

Should one then just say "Use ISO 8859"? Well, what ISO 8859?
There are several parts, latin 1, latin 2 (eastern Europe),
Greek, Cyrillic, Arabic, Hebrew (among others). The abovementioned
character 0162 has different meanings in these different character sets.
ISO 8859-1 would be the natural choice (and is also specified in a recent
RFC on encoding: header.) But is that fair? I think that is like inventing
a new ASCII, only capable of serving one region of the world 
sufficiently - this time having Western Europe (EEC) and all of
North and South America covered. We should do something that could
cover the whole world.

It is also quite hard to persuade your manufactures to change their
implementation character set, and even worse for equipment you already
have bought and installed. Some of this may even be running software
with no 8-bit capabilities!

I think it would be nice to be able to support all of these new and
oldie systems, and I have done an implementation of Sendmail capable
of supporting more than 60 character sets. It currently does not touch
the headers, but only the mail body. For characters not in the 
current character set, it encodes this character with a mnemonic 
code, for example a'  for the above mentioned "small a with acute".
Thus even in ASCII you can get the message!

The sendmail patches are available with anon ftp in dkuug.dk:pub/ch.shar
and sm5.64.8+bit.pa (sm.8+bit.pa for 5.61). Its about 100 kb - the Sendmail
patches itself is under 100 lines, the rest is the character set stuff.
It has been running here at dkuug.dk since Feb 90.

A new ISO standard is showing up: ISO 10646 (which just has been
published as a Draft International Standard (DIS)).
This covers all characters in the world, with very few exceptions.
And the exceptions are planned to be included in a later issue.
Actually Dan Oscarsson and I have been planning (mostly Dan)
to do a SMTP implementation for Sendmail negotiation 10646 for
transmission, and write an RFC for this character set negotiation.

Keld Simonsen

jacob@gore.com (Jacob Gore) (01/02/91)

/ comp.text / keld@login.dkuug.dk (Keld J|rn Simonsen) / Jan  1, 1991 /
> Should one then just say "Use ISO 8859"? Well, what ISO 8859?
> There are several parts, latin 1, latin 2 (eastern Europe),
> Greek, Cyrillic, Arabic, Hebrew (among others). The abovementioned
> character 0162 has different meanings in these different character sets.

Specify the character set in the header.  For example,

	Char-Encoding: ISO-8559-Latin-1

Jacob
--
Jacob Gore		Jacob@Gore.Com			boulder!gore!jacob

les@chinet.chi.il.us (Leslie Mikesell) (01/02/91)

In article <1990Dec31.013538.9473@Think.COM> barmar@think.com (Barry Margolin) writes:
>In article <1990Dec31.004055.10335@cbnewsk.att.com> hansen@pegasus.att.com (Tony L. Hansen) writes:
>>System V release 4 mail is completely content transparent. As long as the
>>transport media is capable of handling the mail, SVr4 mail will be able to
>>get it to you unchanged.

>What does it do when sending textual mail to a system that doesn't use
>ASCII encoding, e.g. an IBM mainframe, or to a system with a different
>newline convention (e.g. CRLF rather than LF)?  SMTP places restrictions on
>the characters that may appear in a message to support automated
>translation during the transfer process.

But the automated translation can currently only work with text while
many mailers are now capable of attaching arbitrary binary data to
messages.  Depending on the type of the content, a different transformation
(or none) may be desired.  Assuming that the non-textual portions are
encapsulated with "Content-Type:" and "Content-Length:" headers, it
would be easy for the transport to determine what, if any, transformation
to use.  In addition, an optional "Encoding-Method:" header can allow
temporary transformations to meet the character set requirements of the
transports.  If the sending program had a way to determine the capabilities
of the recipient, encoding could be done on-the-fly, using uuencode or
atob, and thus only done where necessary (but I don't know of anyone
actually doing this yet...).

These issues are going to have to be addressed for messages originating
on X.400 systems anyway, so why not try to do it efficiently by adding
the equivalent functionality to SMTP/uucp mailers?

Les Mikesell
  les@chinet.chi.il.us

tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) (01/03/91)

keld@login.dkuug.dk (Keld J|rn Simonsen) writes:
> 
> Should one then just say "Use ISO 8859"? Well, what ISO 8859?
> There are several parts, latin 1, latin 2 (eastern Europe),
> Greek, Cyrillic, Arabic, Hebrew (among others)...
> We should do something that could cover the whole world.

This is what Unicode is for.  Unicode should be considered the most
useful and implementable subset of the draft standard ISO 10646.

Unicode is an unambiguous fixed-length 16-bit global codeset currently
under development by the Unicode Consortium.  Unicode offers a uniform
text and character standard that can encompass all living languages and
form a long-lasting basis for worldwide data exchange.  Unicode makes
all 65,535 slots available, with these constraints:

o The first 256 slots duplicate the arrangement of ASCII and ISO
  Latin-1.

o Characters unique to a language are grouped together in standard
  order.

o Letters, punctuation, symbols, and diacritics shared by multiple
  languages are grouped together.

o Asian pictographs are grouped together in order of frequency
  (as specified by national standards), then sorted in traditional
  radical/stroke order.

o Chinese, Japanese, and Korean phonetic symbols are grouped together
  by language in standard order.

The reason 16 bits are enough is that Asian pictographs which everyone
would recognize as the same have been unified.  Thus, more than 31,000
characters have been reduced to about 20,000 slots.

		Major Han Character Standards

	Country	  Standard	Year	Characters
	-------	  --------	----	----------
	China	   GB 2312	1980	 6,763
	Japan	 JIS X0208	1983	 6,349
	Korea	  KS C5601	1987	 4,888
	Taiwan	 CNS 11643	1986	13,051
					------
		     total		31,051

In addition to East Asian languages, here are the writing systems
currently available in Unicode: Greek, Cyrillic, Georgian, Armenian,
Hebrew, Arabic, Ethiopian, Devanagari, Bengali, Gurmukhi, Gujarti,
Oriya, Tamil, Telegu, Kannada, Malayalam, Sinhalese, Thai, Lao,
Burmese, Khmer, Tibetan, and Mongolian.

keld@login.dkuug.dk (Keld J|rn Simonsen) (01/03/91)

tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) writes:

>keld@login.dkuug.dk (Keld J|rn Simonsen) writes:
>> 
>> Should one then just say "Use ISO 8859"? Well, what ISO 8859?
>> There are several parts, latin 1, latin 2 (eastern Europe),
>> Greek, Cyrillic, Arabic, Hebrew (among others)...
>> We should do something that could cover the whole world.

>This is what Unicode is for.  Unicode should be considered the most
>useful and implementable subset of the draft standard ISO 10646.

Is UNICODE a true subset of ISO 10646?
Is there a well defined relation between ISO 10646 encoding and UNICODE?

Seasons greetings!
Keld

tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) (01/04/91)

keld@login.dkuug.dk (Keld J|rn Simonsen) writes:
> 
> Is UNICODE a true subset of ISO 10646?
> Is there a well defined relation between ISO 10646 encoding and UNICODE?

ISO 10646 is still in draft form.  Both questions are impossible to answer
until 10646 gets finalized.  Disclaimer: I'm not an expert in this area.
However, extrapolating from what I know, it appears that Unicode could be
considered a 16-bit implementation of 10646.  The ISO 10646 draft standard
appears to permit 16-bit implementations of any subset thereof, for use in
process code or communication.  It just so happens that Unicode covers all
Asian characters enumerated by existing national standards, plus characters
from languages that the 10646 draft hasn't even thought about.  So it may
be a subset, but a largely complete subset.

Lee Collins writes:
> Notice that 10646 would require 93,816 separate codes to cover existing
> [Chinese/Japanese/Korean] standards.  Han Unification allows Unicode to
> cover the same standards with only 18,739 unique characters.

Ken Whistler writes:
> Unicode 1.0 also includes the following scripts omitted from DIS 10646:
> Ethiopian, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,
> Malayalam, Sinhalese, and Lao.

There have been attempts to convert Unicode to 10646 and back again,
I believe with mostly good results.  Of course, some data may be lost
in the translation.