npn@cbnewsl.att.com (nils-peter.nelson) (12/18/90)
We have gotten several requests for support of international
character sets, and for "8 bit clean" troff. We need some
help figuring out what is needed. Ignoring for the moment
Kanji as a separate case, this inquiry is directed to the
European market.
If we select a specific character, say the Swedish character
that looks like an a with a circle over it, as in the second
character of my ancestral home town, Bastad, then the following
implementations are possible:
1. Using the current DWB troff longname convention, you can
address any PostScript character, viz. "B\N'aring'stad"
(aring is the Adobe PostScript name for this character, which,
for the non-Swedes, in pronounced "oh").
2. We could invent a new troff escape, viz. "B\(aostad".
We don't want to invent one if there is an established convention.
3. There are apparently 8 bit internal representations that
are created by double striking keys (perhaps ESCAPE A, but I
don't know). I need to know if there is a practical standard
for such (i.e., one people actually use).
4. I recall some countries (e.g., Scandinavian) use ordinary 7 bit
punctuation chars in lieu of these letters, viz. "B{stad"
or somesuch.
I don't want to hear about every pending international standard.
I want to hear what people who will be using the software would
like to see.
If you have answers to the above, please send them to npn@mhuxo.att.com.
I am looking for data, not opinions, so replies from Europe
will be heavily weighted over others. I am not trying to
create another "gaol" vs. "jail" controversy.yfcw14@castle.ed.ac.uk (K P Donnelly) (12/19/90)
I have come in on this discussion from the outside, but it sounds as if
you have misunderstood the requests.
It sounds to me as if what people are asking for is for troff to stop
stripping the eighth bit off characters in the input file, but instead
to pass them to the output file just like (7-bit) ASCII characters.
There is no need to invent new (7-bit) ASCII representations of non
ASCII characters, such as \ao for a-ring. Such representations may
be desirable for other purposes, but that is a separate issue.
Anyone with a VT220 or VT320 terminal can input a-ring using the
three character sequence
<Compose-character> a *
the character gets hexadecimal code E5 in compliance with the ISO
standard for western European languages. You see it on the screen and
edit it just like any other character using any editor (such as the
version 3.10 of microEmacs) which doesn't strip the eighth bit.
However, it gets very frustrating if the software which gets between you
and the laser printer insists for no good reason on stripping the eighth
bit and turning the a-ring into an 'e' (ASCII code 65 hex).
There is lots of such software around, especially on Unix. I think it
is something to do with the eighth bit having been used for parity check
in the past, so the software thought it was safest to filter it out.
Nowadays, with cleaner communications lines, the eighth bit is hardly
ever used for parity check - it wasn't a very good system anyway - and
software packages which in the past have stripped the eighth bit are one
by one changing their policies - witness Kermit 3.0, TeX 3.0, microEmacs
3.10.
The Scandinavians have up til now used "national versions" of ASCII in
which characters like { } ~ | are replaced by national characters like
a-ring. This often makes their names look weird in mail signatures, and
must cause them a lot of trouble when programming in languages like C
and Pascal. The Germans use in computing the alternative system of
placing an 'e' after the vowel instead of an umlaut sign above it. The
French have such a variety of accented characters that in computing
(mail messages and so on) they usually give up and leave out the
accents. Sometimes they use devices like puting an apostrophe before or
after the vowel to indicate an acute accent. On the Gaelic language
bulletin board in which I participate, we always use a slash, '/', after
the vowel to indicate an acute accent. The Icelanders have many more
non-ASCII characters in their language than other Scandinavians,
including some unique ones such as 'eth' and 'thorn' where you can't
just "leave out the accent", so they have for long been ahead of the
world in using 8-bit text in their computing.
Kevin Donnellyrcd@ico.isc.com (Dick Dunn) (12/20/90)
yfcw14@castle.ed.ac.uk (K P Donnelly) writes: > It sounds to me as if what people are asking for is for troff to stop > stripping the eighth bit off characters in the input file, but instead > to pass them to the output file just like (7-bit) ASCII characters. It's not at all that simple. Troff has to know about the characters--it needs to be able to find them in its width tables and know whether the characters have ascenders and/or descenders (for sb/st/ct number regs). There's also an issue of whether troff should produce 8-bit codes on its output--there are some good arguments that it should not. The matter of 7-bit data paths is rather more complicated (and clumsy) than the single issue of a parity bit that Donnelly mentions. There are some methods of data interchange, such as most email systems, that are inherently 7-bit. It would be nice if we could just banish them, but compatibility is an albatross. The issue of inventing alternate representations, such as \(ao for "a ring" goes beyond the issue of simple 8-bit transparency. There are many more characters needed than can be represented in an 8-bit code set. Certainly one wants a conventional 8-bit set (such as Latin 1) for convenience, but more characters are needed even for European usage. It is useful to have a canonical representation in terms of 7-bit codes even if it's not the most commonly used. > The Scandinavians have up til now used "national versions" of ASCII in > which characters like { } ~ | are replaced by national characters like > a-ring... These are not ASCII. They are national versions of ISO 646. If you like, you could think of ASCII as a "national version" of ISO 646 used in the USA. 646 provides a few codes which are reserved for national characters; ASCII provides a particular assignment to those codes. The Scandinavian conventions are simply different assignments. > ...The Germans use in computing the alternative system of > placing an 'e' after the vowel instead of an umlaut sign above it... This alternative representation far predates computer usage, although it is certainly a convenient solution. Note also that scharfes ess turns into "ss". -- Dick Dunn rcd@ico.isc.com -or- ico!rcd Boulder, CO (303)449-2870 ...Mr. Natural says, "Use the right tool for the job."
clewis@ecicrl.UUCP (Chris Lewis) (12/21/90)
In article <1990Dec20.012516.23623@ico.isc.com> rcd@ico.isc.com (Dick Dunn) writes: >yfcw14@castle.ed.ac.uk (K P Donnelly) writes: >> It sounds to me as if what people are asking for is for troff to stop >> stripping the eighth bit off characters in the input file, but instead >> to pass them to the output file just like (7-bit) ASCII characters. >It's not at all that simple. Troff has to know about the characters--it >needs to be able to find them in its width tables and know whether the >characters have ascenders and/or descenders (for sb/st/ct number regs). Donnelly is right though, that's primarily what people *do* want. (Note that 8-bit clean in ditroff doesn't imply that ditroff is passing the 8-bit characters directly to the printer - on the contrary, it only means that 8-bit characters can appear in the ditroff *input* file file, and 8-bit characters can be found (somehow) in the width tables. There's no particular reason that the ditroff format output file (troff(5)) actually contains 8-bit characters, for this file isn't really intended to be read, and extensions to the "c<char>" directive could be altered to permit octal or some other 7-bit clean representation if necessary. [Refresher on ditroff guts: troff document -> ditroff -> ditroff intermediate -> filter -> printer ^ ^ | | +-------------+--------------------+ | width tables The width tables contain the width and kerning information that ditroff needs to know for character placement, and also contains the byte that the filter emits for each character (though, the filter doesn't have to use them). The ditroff intermediate is a displayable file with simple commands indicating character placement, font size, points etc.] It shouldn't be all that much harder for ditroff to permit 8-bit characters in the width tables. After all, it does permit octal sequences in the fourth field (the character the backend is to emit to generate the desired glyph). It would be nicest if the left most column (the character ditroff is searching for) could be 8-bit, but octal would probably serve in a pinch, permitting both would be even better (and would permit transmission of these files over 7-bit paths/editting via 7-bit vi's etc.) Psroff's analogous tables do permit this. T'would be especially nice now that the newest vi's are now 8-bit clean. And emacs is now as well. >There's also an issue of whether troff should produce 8-bit codes on >its output--there are some good arguments that it should not. The matter >of 7-bit data paths is rather more complicated (and clumsy) than the single >issue of a parity bit that Donnelly mentions. There are some methods of >data interchange, such as most email systems, that are inherently 7-bit. >It would be nice if we could just banish them, but compatibility is an >albatross. Since you're talking about troff generating 8-bit codes on its output, I'm not sure that this is a real issue, because the intermediate ditroff format output isn't really an interchange format. Regardless, what people want is the ability to jam 8-bit characters into the input of ditroff and have it do sane things, not necessarily their representation in the intermediate file. As it is now, the width tables *do* permit the filter to emit 8-bit characters to the printer - they *have* to. On that note, you might have trouble getting 8-bit characters to the printer, but that's the OS's/system administrator's/printer designer's fault. On the other hand, permitting ditroff to accept 8-bit characters on *input* may get people into trouble when they try to mail something through a 7-bit path. But it isn't all that difficult to solve, either by uuencoding (or similar) or having a program that converts the 8-bit characters to the \(xx convention (and vice-versa) (I'm in fact going to be implementing something like this in Psroff). People solve it all the time when shipping PC binaries. Requiring all of Europe to have to type those silly 4-character sequences when trying to edit documents in their own language when 8-bit is *easy* isn't a very nice thing to do to faithful customers. And it isn't *just* Europe. It's Canada too. (I'm a member of the CSA/Treasury Board Canadian Posix Working Group). Canada is also trying to encourage Latin-1 because of bilingual (English/French) requirements both in government and in the private sector. 7-bit ASCII is very nearly *only* the USA (most other English speaking countries are either tending towards Latin-1, or a different version of 646. 646 in all it's national variants only satisfies completely a minority of the Roman-alphabet countries). Lest one think that Canada is a minor addition to Western Europe in this context, one should remember that Canada is the US's biggest single trading partner by a substantial margin (considerably larger than Japan). The only market bigger than Canada is the EEC taken as a unit (aka Western Europe, aka the other Latin-1 countries). Markets, markets! Heck, if you're ever even thinking about Kanji, you really should satisfy the considerably larger group of customers that need Latin-1. >The issue of inventing alternate representations, such as \(ao for "a ring" >goes beyond the issue of simple 8-bit transparency. It *isn't* transparency, on the contrary. HOWEVER, having both the 8-bit input transparency as well as alternate representations that need only 7-bit would be a definate plus, so that people with 646 compliant terminals can do the same things that the newer Latin-1 ones can. And it ain't all that hard to do. Hell, if I can do it in psroff via CAT troff without source, AT&T should be able to do it with ditroff. -- Chris Lewis, Phone: (613) 832-0541 UUCP: uunet!utai!lsuc!ecicrl!clewis Moderator of the Ferret Mailing List (ferret-request@eci386) Psroff mailing list (psroff-request@eci386)
heimir@rhi.hi.is (Heimir Thor Sverrisson) (12/24/90)
In <1990Dec20.012516.23623@ico.isc.com> rcd@ico.isc.com (Dick Dunn) writes: > There are some methods of >data interchange, such as most email systems, that are inherently 7-bit. >It would be nice if we could just banish them, but compatibility is an >albatross. There is nothing that tells me that email systems should be _inherently 7-bit_. In fact here in Iceland we have to hack almost every piece of communications software to be able to use it in our _inherently 8-bit environment_. This I can see no reason at all for some stupid mailers to strip off the eighth bit (including Interactive's version of sendmail). They don't have to - and should not - interpret the contents of the mail they are transmitting. This is quite different from the troff situation where a program has to know a lot about it's input set. So why don't you guys simply open up your mailers and be ready with a 8-bit clean version by the end of 1991! -- Heimir Thor Sverrisson heimir@rhi.hi.is
keld@login.dkuug.dk (Keld J|rn Simonsen) (12/26/90)
I was the coauthor of an article on 7-bit names for troff of
non-ASCII characters, together with some support for national
ISO 646 variants. Here is the article. Enjoy!
: This is a shar archive. Extract with sh, not csh.
: This archive ends with exit, so do not worry about trailing junk.
: --------------------------- cut here --------------------------
PATH=/bin:/usr/bin:/usr/ucb
echo Extracting 'Makefile'
sed 's/^X//' > 'Makefile' << '+ END-OF-FILE ''Makefile'
Xextchar.dit: extchar specchar
X refer -e -p typesetting extchar | tbl | troff $(DEVICE) -ms >extchar.dit
X
Xspecchar: spec nicetr
X sh nicetr spec >specchar
X
Xprint: extchar.dit
X dip $(DEVICE) extchar.dit
X
Xallchar: spec nicetr spec.p400
X sh nicetr spec spec.p400 >allchar
X
Xdistr:
X shar Makefile extchar spec nicetr tmac.la pchdefs typesetting >distr
+ END-OF-FILE Makefile
chmod 'u=r,g=r,o=r' 'Makefile'
set `wc -c 'Makefile'`
count=$1
case $count in
351) :;;
*) echo 'Bad character count in ''Makefile' >&2
echo 'Count should be 351' >&2
esac
echo Extracting 'extchar'
sed 's/^X//' > 'extchar' << '+ END-OF-FILE ''extchar'
X.ds LF DRAFT
X.so pchdefs
X.so tmac.la
X.la US
X.TL
XAn extension to the troff character set
X.AU
XE.G. Keizer
XK.J. Simonsen
XJ. Akkerhuis
X.AI
XVrije Universiteit, Amsterdam, The Netherlands
XUniversity of Copenhagen, Copenhagen, Denmark
XC.M.U., USA
X.AB
XThe typesettting program
X.I troff
Xwas originally written for formatting English text for the CAT 48 Typesetter.
XIts offspring is used for formatting a variety of languages with a large
Xdiversity of output devices.
XThe authors agreed on an addition
Xto the troff character set covering old and new national and
Xinternational latin based character sets.
X.AE
X.NH
XThe problems
X.PP
XWhen adapting the
X.UX
Xtypesetting program
X.I troff\|
X.[
XOssanna
X.]
Xto a new output device, one wants to have access
Xto the extra characters offered by the device,
Xwithout sacrificing any characters already in use.
XDevice independent \f2troff\fP, also called
X.I titroff
Xor
X.I ditroff\|
X.[
XTypesetter independent troff Kernighan
X.]
Xuses a flexible font definition mechanism that allows addition and deletion
Xof characters.
X.LP
XMany people, including the authors, have used this mechanism
X.[
XKahrs and Moore
X.]
Xto add characters.
XThis has led to a diversity of names, with the expected conflicts
Xof using the same name for different characters and different names for the same
Xcharacter in different implementations.
XThus
X.I troff
Xinput files are becoming less and less portable,
Xeven for the same output device on different installations.
XWe regret this development and, during a conference in Copenhagen,
Xwe decided to make an attempt at some standardization.
XB.W. Kernighan, author of ditroff, agreed to our proposal of acting as a
Xclearing house for our new names and he still has to give his blessing
Xto this article.
X.PP
XWe realize that it is impossible to name every printable character in the
Xworld.
XThe total amount of different characters is simply too huge.
XNaming all the hundred different turtles in a turtle font is both
Xfrustrating and futile.
XWe restricted ourselves to the following categories:
X.IP \(bu
Xcharacters belonging to the printed language of several Western-European
Xcountries: \(AE \(ss
X.IP \(bu
Xvariations of letters, especially with accents: \(:i \(^u \(:o
X.IP \(bu
Xoften used mathematical symbols: \(AN \(OR \(c*
X.NH 2
XNatural language support
X.PP
XThe
X.I troff
Xcharacter set is based on the US-ASCII standard.
XThis standard is well suited to English text,
Xbut causes problems when used in most European countries.
XUS-ASCII is a version of the ISO 646-1983 standard.
XThe ISO standard states the characters used in 7 bit ASCII
Xand contains 94 printable graphic symbols.
XISO-646 allows national versions for 12 of its character positions.
XThe European Computer Manufacturers' Association ECMA is registering
Xall different national versions of ISO 646-1983 and assigns
Xa different character to each.
XThis allows the creation of documents with multiple character sets.
XThe assigned character serves to identify each character set in such documents.
XThe table below shows several versions conforming to ISO 646.
X.TS
Xcenter,allbox;
Xc s s s s s s s s s s s s s s
Xl l l s s s s s s s s s s s s
Xl l l c c c c c c c c c c c c.
XNational ISO 646 character sets
XCountry Standard .la parameter
XISO ISO 646 IRV ISO \(sh \(Cs \(at \(lB \(rs \(rB \(ha \(oq \(lC \(ba \(rC \(rn
XUSA X3.4-1968 US \(sh \(Do \(at \(lB \(rs \(rB \(ha \(oq \(lC \(ba \(rC \(ti
XGreat Britain BS 4730 GB \(Po \(Do \(at \(lB \(rs \(rB \(ha \(oq \(lC \(ba \(rC \(rn
XJapan JIS C 6229 JP \(sh \(Do \(at \(lB \(Ye \(rB \(ha \(oq \(lC \(ba \(rC \(rn
XChina GB 1988-80 CN \(sh \(Ye \(at \(lB \(rs \(rB \(ha \(oq \(lC \(ba \(rC \(rn
XDenmark DS 2089 DK \(sh \(Do \(at \(AE \(/O \(oA \(ha \(oq \(ae \(/o \(oa \(ti
XNorway NS 4551-1 NO \(sh \(Do \(at \(AE \(/O \(oA \(ha \(oq \(ae \(/o \(oa \(rn
X.la NO2
X NS 4551-2 NO2 # $ @ [ \(/O ] ^ ` { | } ~
X.la FI
XFinland FI # $ @ [ \e ] ^ ` { | } ~
XSweden SEN 850200 B SE \(sh \(Cs \(at \(:A \(:O \(oA \(:U \(oq \(:a \(:o \(oa \(rn
X SEN 850200 C SE2 \(sh \(Cs \*('E \(:A \(:O \(oA \(:U \('e \(:a \(:o \(oa \(:u
XGermany DIN 66 003 DE \(sh \(Do \(sc \(:A \(:O \(:U \(ha \(oq \(:a \(:o \(:u \(ss
XHungary MSZ 7795/3 HU \(sh \(Cs \*('A \*('E \(:O \(:U \(ha \('a \('e \(:o \(:u \(a"
X.\" Jugoslavia JUS I.B1.002 JS \(sh \(Do \*(vZ \*(vS \*(-D \*('C \*(vC \*(vz \*(vs \*(-d \*(vc \*('c
X.la FR
XFrance NF Z 62-010 FR \(Po \(Do \(`a \(de \(,c \(sc \(ha \(my \('e \(`u \(`e \(ad
XItaly IT \(Po \(Do \(sc \(de \(,c \('e \(ha \(`u \(`a \(`o \(`e \(`i
X.la ES
XSpain ES # $ @ [ \e ] ^ ` { | } ~
X.la ES2
X ES2 # $ @ [ \e ] ^ ` { | } ~
XPortugal PT \(sh \(Do \(sc \*(~A \(,C \*(~O \(ha \(oq \(~a \(,c \(~o \(de
X PT2 \(sh \(Do \(aa \*(~A \(,C \*(~O \(ha \(oq \(~a \(,c \(~o \(ti
X.TE
X.la US
X.vs
X.ps
X.PP
XTerminals in these countries are often
Xadapted to these national variations.
XCreating
X.I troff
Xinput for Danish texts on Danish terminals is a frustrating experience.
XOne has to type
X.B \e(AE
Xfor
X.B \(AE
Xin spite of the presence of a special key for \(AE.
XYou can by-pass this by using the
X.I troff
Xcommand \f3.tr\fP,
Xwhich allows the mapping of any character to any other.
XWe have employed the \f3.tr\fP command in a macro \f3.la\fP
Xwhich is designed to make it possible to shift between
Xall the ISO 646 input charater sets in the above table.
XThe macro takes a code for the country as parameter; the first
Xtwo letters being the ISO 3166 two-letter country code.
XThe \f3.la\fP macro can be used like this:
X
X.ft CW
X.nf
X .la US
X First we write something in "God's own" character set.
X .la DK
X S} skriver vi noget s|dt p} dansk: sodavandsis.
X .la DE
X Und f}r Deutschen k|nnen wir auch etwas schreiben!
X .la US
X.fi
X
X.ft
Xgiving:
X.br
X.ft
XFirst we write something in "God's own" character set.
X.la DK
XS} skriver vi noget s|dt p} dansk: sodavandsis.
X.la DE
XUnd f}r Deutschen k|nnen wir auch etwas schreiben!
X.la US
X.ft
X
XThe \f3.la\fP
Xmacro can only be used when all the characters of the character sets in use
Xhave a unique code on the printing device. Also you cannot change input
Xcharacter set within a diversion in
X.I troff,
Xyou need to use the special character names if you want to use foreign
Xcharacters within a sentence. An example of having a French name
Xin a Danish text:
X
X.ft CW
X.nf
XJeg s} Jer\e(^ome Fran\e(,cois komme til K|benhavn.
X.ft
Xgiving:
X.ft
X.la DK
XJeg s} Jer\(^ome Fran\(,cois komme til K|benhavn.
X.ft
X.fi
X.la US
X.PP
XTo be able to use all other national characters within a national
Xcharacter set,
Xwe decided to introduce names for all the different national
Xcharacters,
Xeven for the 'default' US names.
X.PP
XAlso we went through the new ISO standards for Latin alphabets
X(ISO 8859)
Xand assured that all special characters there would have a unique
Xname according to this proposal.
X.PP
XA last warning is about the
X.I troff
Xescape character \e and the national characters taking its place.
XHere you must write the national character followed by an 'e'
Xto get the desired result.
X
XThe \f3.la\fP macro has the follwing contents:
X.DS
X.ft CW
X.ps 6
X.nr Sw \w' '
X.ta 8u*\n(Swu 16u*\n(Swu 24u*\n(Swu 32u*\n(Swu 40u*\n(Swu 48u*\n(Swu 56u*\n(Swu
X.eo
X.nf
X.cc &
X&so tmac.la
X&cc .
X.fi
X.ec
X.ps
X.DE
X.ft 1
X.NH 2
XProblems we did not pursue
X.PP
XWritten English is based on the latin alphabet,
Xit hardly uses accents and other variations of letters.
XOther languages make use of accents above (\(`e), below (\(,c) and
Xthrough (\(/o) the letters.
XWe do not address the problems of languages with more than one
Xaccent per letter and
X. \"(\o'v\(aa'\h'-\w'v'u'\v'0.1m',.\v'-0.1m'\h'0.2m'),
Xaccents connecting letters.
XIn our opinion these problems have to be solved by separate preprocessors.
XThis allows a much more friendly user interface.
XThese preprocessors could also solve the problem of hyphenation
Xand ligatures for these languages.
X.NH 2
XThe troff naming scheme
X.PP
X.I Troff
Xhas three ways of naming characters:
X.IP \(bu
Xone character ASCII names like
X.B A
Xfor A,
X.B B
Xfor B
Xand
X.B @
Xfor @.
X.IP \(bu
Xescaped one character names prefixed by a
X.B \e
Xlike
X.B \e\-
Xfor current font minus,
X.B \ee
Xfor backslash
Xand
X.B \e\'
Xfor acute accent.
XThere are only a few of these.
X.IP \(bu
Xtwo character names prefixed by the indicator
X.B \e(
Xlike
X.B \e(sc
Xfor \(sc,
X.B \e(*g
Xfor \(*g
Xand
X.B \e(14
Xfor \(14.
X.LP
XThe sets of one-character names and escaped one character names are fixed.
XOnly the set of two character names can be extended.
X.NH
XChoosing new names
X.PP
XWhile choosing names for new characters we were very much aware
Xof the fact that the restriction of two characters per name
Xdefies all attempts to choose a consistent and logical naming scheme.
XStill we used a few principles in choosing the new names.
XWhenever these principles conflicted we refrained from long
Xdiscussions but placed more value on a quick decision.
X.LP
XOur principles:
X.IP \(bu
Xmay not conflict with original troff manual
X.[
XOssanna
X.]
X.IP \(bu
Xtry to avoid the national characters:
X\(sh \(Do \(at \(lB \(rs \(rB \(ha \(oq \(lC \(ba \(rC \(ti
X.IP \(bu
Xuse characters associated with the graphical description of the symbol.
XThus
X.B \e(oA
Xfor
X.B \(oA
Xinstead of \f3\e(AA\f1.
X.IP \(bu
XWhenever a character is a combination of an accent and a letter
Xuse as name a character representing the accent followed by the letter.
XFor example: \f3\e('e\f1 for \('e.
XThis is the way these combinations were made on old-fashioned
Xtypewriters:
Xfirst hit the dead key with the accent then the key with the letter.
XThe characters we used for the accents (and the like) are:
X.br
X.sp 0.6
X.TS
Xallbox,center;
Xl c c c c c c c c c c c c c.
XASCII \(aa \(ga : , \(ti o \(ha . " u v - /
XAccent \(aa \(ga \(ad \(ac \(ti \(de \(ha . \(a" \(ab \(ah \(hy \(sl
XName aa ga ad ac ti de ha a. a" ab ah hy sl
X.TE
X.IP \(bu
Xfractions are done in the natural way: \f3\e(\f2nm\f1
Xfor \s-3\v'-0.45m'n\v'0.45m'\s0/\s-3m\s0.
X.IP \(bu
XCurrency signs have two letter names.
XThe first letter is Capital, the second small.
X.IP \(bu
XThe names for accents start with an \f3a\fP, for example
X.B \e(ad
Xfor \(ad.
X.NH
XThe weird characters one has
X.LP
XSuppose you have a
X.I flat,
Xwhich is unlikely on any other device, but
Xstill want to have your input send to your friend/editor which can run
Xit on his troff.
XThe first thing to do is not using the character as it is
Ximplemented on your troff (f.i. \e(ft), but define a
Xstring at the beginning of the text and use that.
XThe next thing to do is to give a description how it should look like.
XSo if we stick with our flat, you end up at the start of the your
Xinput file with:
X.DS
X.ft CW
X .\e" We use a "flat" (\e(ft) a lot instead of a backslash, because it
X .\e" stands out nicely and, since a backslash can be interpreted in
X .\e" so many ways, I want to make clear that you see an escape when I
X .\e" talk about it....
X .\e" So I define the string ft
X .ds ft \e(ft
X .\e" A "flat" is what in music stands for lowering the current
X .\e" tone. If that not is clear, consider a | (pipe character) with
X .\e" an small circle attached to it on the left bottom side.
X .\e" A define in the style of
X .\e" .ds \eo'|o'
X .\e" will do if you don't have anything more than a lineprinter around.
X .\e" If you want to be really fancy, you might want to try
X .\e" something like:
X .\e" .nr x \ew'o'/2u
X .\e" .nr y .2m
X .\e" .ds ft \ev'-\enyu'\ez|\eh'\enxu'\ev'\enyu'\eS'-9'o\eS'0'
X .\e" Fancy ain't it?
X.DE
X.ft 1
XSo the rest of your article will look (at the input side) as:
X.DS
X.ft CW
X An escape (denoted by \e*(ft) in troff will introduce a two
X character name XX by \e*(ft( (so \e*(ft(XX) and a two character
X named QQ string interpolation will be triggered
X by \e*(ft*( (\e*(ft*(QQ).
X.DE
X.NH
XThe future
X.PP
XNew versions of
X.I troff
Xinclude a more forms of naming characters.
XBoth \f3\eC\'\f2arbitray\ long\ name\f3\'\f1 and
X\f3\eN\'\f2absolute\ Number\ of\ character\ in\ current\ font\f3\'\f1
Xare used.
XThe first allows character names of arbitrary length.
XThe latter allows unnnamed characters.
XThis does not solve the problems discussed in this article.
XWorse, it will even be harder to choose names upon which
Xa sizeable number of people will agree.
X.NH
XBibliography
X.LP
X.[
X$LIST$
X.]
X.bp
X.NH
XThe character set
X.LP
X.so specchar
+ END-OF-FILE extchar
chmod 'u=r,g=r,o=r' 'extchar'
set `wc -c 'extchar'`
count=$1
case $count in
12207) :;;
*) echo 'Bad character count in ''extchar' >&2
echo 'Count should be 12207' >&2
esac
echo Extracting 'spec'
sed 's/^X//' > 'spec' << '+ END-OF-FILE ''spec'
XWe divided the new characters in four categories.
XEach character is mentioned only once.
XWhenever we doubted we tried to place a character in the category we thought was
Xmost suitable. Lastly we included the characters from Ossanna's
Xtroff document for reference.
X.NH 2
XSymbols from ISO 646 standards
X.LP
X.TS
Xsh sharp
XYe Yen
XCs Currency sign
XDo Dollar
XPo English pound
Xat at sign
XlB left square bracket
Xrs backslash
XrB right square bracket
Xha hat or accent circumflex
XlC left curly bracket
XrC right curly bracket
Xba bar (possibly broken)
Xti tilde
Xa" accent double quote
XAE AE
X/O O slash
XoA A circle
Xae ae
X/o o slash
Xoa a circle
X:A A diaeresis
X:O O diaeresis
X:U U diaeresis
X'e e acute accent
X:a a diaeresis
X:o o diaeresis
X:u u diaeresis
Xss German ringel S
X`a a grave accent
X,c c cedilla
X`e e grave accent
X`u u grave accent
X`o o grave accent
X`i i grave accent
Xr! reverse !
X~N N tilde
Xr? reverse ?
X~n n tilde
X~a a tilde
X^a a hat
X,C C cedilla
X^e e hat
X^i i hat
X~o o tilde
X^u u hat
X'E \*('E E acute accent
X'A \*('A A acute accent
X'a a acute accent
X'i i acute accent
X'c \*('c c acute accent
X'C \*('C C acute accent
X~O \*(~O O tilde
X~A \*(~A A tilde
X-d d bar
X-D Capital Icelandic Eth (D) / D bar
X.TE
X.ne 10
X.NH 2
XSymbols from ISO 8859 standards
X.LP
X.TS
XSd small Icelandic eth (d)
X/l Polish l
X/L Polish L
XTp Small Icelandic Thorn
XTP Capital Icelandic Thorn
Xbb broken bar
XS1 \h'0.9n'\v'-1n'\s-3\&1\s0\v'1n' superscript 1
XS2 \h'0.9n'\v'-1n'\s-3\&2\s0\v'1n' superscript 2
XS3 \h'0.9n'\v'-1n'\s-3\&3\s0\v'1n' superscript 3
X:e e diaeresis
X`E E grave accent
X'o o acute accent
X^o o hat
X'u u acute accent
XOf feminin ordinator indicator
XOm masculin ordinator indicator
XFo double french open quote
XFc double french close quote
Xac accent cedilla
Xad accent diaeresis
Xps english paragraph sign
X:i i diaeresis
X12 one half
X14 one quart
X34 three quart
Xmd centered dot
Xno not
X.TE
X.ne 10
X.NH 2
XTypographical characters
X.LP
X.TS
Xab accent breve
Xao accent corona
Xah accent ha\o'c\(ah'ek
Xa. accent dot
Xho hook
X.i dotless i
X.j dotless j
Xfo french open quote
Xfc french close quote
XIJ IJ ligature IJ
Xij ligature ij
Xtm Trade Mark
Xoq open quote
Xoe French oe
XOE French OE
XOK check mark
X.TE
X.ne 20
X.NH 2
XMathematical characters
X.LP
X.TS
X%0 per mille
X-h h bar
Xsd second sign
Xc+ circle plus
Xc* circle times
X>~ approximately greater
X<~ approximately less
X<< much less
X>> much greater
X=~ approximately equal
XOR logical or
XAN logical and
Xfa for all
Xte there exists
X3d therefore
Xpp perpendicular to
X/_ angle
X!< not less
X!> not greater
Xnm not a member
Xcn contains
Xnc does not contain
X~~ approximately
XAh Aleph
Xne not equivalent
X-+ minus plus
X.TE
X.NH 2
XSymbols from the Troff manual by Ossanna.
X.LP
X.TS
Xem 3/4 Em dash
Xhy hyphen
Xbu bullet
Xsq square
Xru rule
X14 1/4
X12 1/2
X34 3/4
Xfi fi
Xfl fl
XFi ffi
Xff ff
XFl ffl
Xde degree
Xdg dagger
Xfm foot mark
Xct cent sign
Xrg registered
Xco copyright
Xpl math plus
Xmi math minus
Xeq math equals
X** math star
Xsc section
Xaa acute accent
Xga grave accent
Xul underline
Xsl slash (matching backslash)
X*a alpha
X*b beta
X*g gamma
X*d delta
X*e epsilon
X*z zeta
X*y eta
X*h theta
X*i iota
X*k kappa
X*l lambda
X*m mu
X*n nu
X*c xi
X*o omricron
X*p pi
X*r rho
X*s sigma
Xts terminal sigma
X*t tau
X*u upsilon
X*f phi
X*x chi
X*q psi
X*w omega
X*A Alpha
X*B Beta
X*G Gamma
X*D Delta
X*E Epsilon
X*Z Zeta
X*Y Eta
X*H Theta
X*I Iota
X*K Kappa
X*L Lambda
X*M Mu
X*N Nu
X*C Xi
X*O Omricron
X*P Pi
X*R Rho
X*S Sigma
X*T Tau
X*U Upsilon
X*F Phi
X*X Chi
X*Q Psi
X*W Omega
Xsr square root
Xrn root en extender
X>= >=
X<= <=
X== identical equal
X~= approx =
Xap approximates
X!= not equal
X-> right arrow
X<- left arrow
Xua up arrow
Xda down arrow
Xmu multiply
Xdi divide
X+- plus-minus
Xcu cup (union)
Xca cap (intersection)
Xsb subset of
Xsp superset of
Xib improper subset
Xip improper superset
Xif infinity
Xpd partial deriative
Xgr gradient
Xnp not
Xis integral sign
Xpt proportional to
Xes empty set
Xmo member of
Xbr box vertical rule
Xdd double dagger
Xrh right hand
Xlh left hand
Xbs Bell System logo
Xor or
Xci circle
Xlt left top of big curly bracket
Xlb left bottom
Xrt right top
Xrb right bottom
Xlk left center
Xrk right center
Xbv bold verticel
Xlf left floor
Xrf right floor
Xlc left ceiling
Xrc right ceiling
X.TE
+ END-OF-FILE spec
chmod 'u=r,g=r,o=r' 'spec'
set `wc -c 'spec'`
count=$1
case $count in
4339) :;;
*) echo 'Bad character count in ''spec' >&2
echo 'Count should be 4339' >&2
esac
echo Extracting 'nicetr'
sed 's/^X//' > 'nicetr' << '+ END-OF-FILE ''nicetr'
Xawk 'BEGIN {
X print ".tr '"'"'\\'"'"'`\\`"
X print ".vs 12p"
X print ".ps 10"
X FS=" "
X}
X/^.TS/ {
X print ".TS H"
X print "lw(1c) lw(1c) lw(5c) lw(1c) lw(1c) lw(5c)."
X print "\\f3Char\\fP\t\\f3Name\\fP\t\t\\f3Char\\fP\t\\f3Name\\fP"
X print ""
X print ".TH"
X tabling=1
X no_entries=0
X}
X/^.TE/ {
X tabling=0
X if ( no_entries%2==1 ) printf "\n"
X}
Xtabling==1 && $1!=".TS" {
X no_entries +=1
X if ( $2=="" ) {
X printf "\\(%s\t\\e(%s\t%s", $1, $1, $3
X } else {
X printf "%s\t\\e(%s\t%s", $2, $1, $3
X }
X if ( no_entries%2==0 ) printf "\n"
X else printf "\t"
X}
Xtabling==0
XEND {
Xprint ".tr '"''"'``"
Xprint ".ps 10"
X}
X' $* | tbl
+ END-OF-FILE nicetr
chmod 'u=r,g=r,o=r' 'nicetr'
set `wc -c 'nicetr'`
count=$1
case $count in
613) :;;
*) echo 'Bad character count in ''nicetr' >&2
echo 'Count should be 613' >&2
esac
echo Extracting 'tmac.la'
sed 's/^X//' > 'tmac.la' << '+ END-OF-FILE ''tmac.la'
X.de la
X.\" languages - keld@dkuug.dk & storm@dkuug.dk
X.\" Covers all ECMA registrered versions of ISO 646
X.\" Countries according to ISO 3166
X.\" Commented is registered ECMA char code and standard number
X.fl
X.ie \\n(.$=0 .ds )L \\*(=L
X.el .ds )L \\$1
X.ds =L \\*(LA
X.ds LA \\*()L
X.rm )L
X.if "\\*(LA"DK" .tr #\(sh$\(Do@\(at[\(AE\\\\\(/O]\(oA^\(ha\`\(ga{\(ae|\(/o}\(oa~\(ti \" DS 2089
X.if "\\*(LA"US" .tr #\(sh$\(Do@\(at[\(lB\\\\\(rs]\(rB^\(ha\`\(ga{\(lC|\(ba}\(rC~\(ti \" B X3.4-1968
X.if "\\*(LA"ISO" .tr #\(sh$\(Cs@\(at[\(lB\\\\\(rs]\(rB^\(ha\`\(ga{\(lC|\(ba}\(rC~\(rn \" @ IRV
X.if "\\*(LA"GB" .tr #\(Po$\(Do@\(at[\(lB\\\\\(rs]\(rB^\(ha\`\(ga{\(lC|\(ba}\(rC~\(rn \" A BS 4730
X.if "\\*(LA"DE" .tr #\(sh$\(Do@\(sc[\(:A\\\\\(:O]\(:U^\(ha\`\(ga{\(:a|\(:o}\(:u~\(ss \" K DIN 66 003
X.if "\\*(LA"FR" .tr #\(Po$\(Do@\(`a[\(de\\\\\(,c]\(sc^\(ha\`\(mu{\('e|\(`u}\(`e~\(ad \" f NF Z 62-010 (1982)
X.if "\\*(LA"CN" .tr #\(sh$\(Ye@\(at[\(lB\\\\\(rs]\(rB^\(ha\`\(ga{\(lC|\(ba}\(rC~\(rn \" T GB 1988-80
X.if "\\*(LA"JP" .tr #\(sh$\(Do@\(at[\(lB\\\\\(Ye]\(rB^\(ha\`\(ga{\(lC|\(ba}\(rC~\(rn \" n JIS C 6229-1984
X.if "\\*(LA"IT" .tr #\(Po$\(Do@\(sc[\(de\\\\\(,c]\('e^\(ha\`\(`u{\(`a|\(`o}\(`e~\(`i \" Y
X.if "\\*(LA"ES" .tr #\(Po$\(Do@\(sc[\(r!\\\\\(~N]\(r?^\(ha\`\(ga{\(de|\(~n}\(,c~\(ti \" Z
X.if "\\*(LA"ES2" .tr #\(sh$\(Do@\(bu[\(r!\\\\\(~N]\(,C^\(r?\`\(ga{\(aa|\(~n}\(,c~\(ad \" h
X.if "\\*(LA"PT" .tr #\(sh$\(Do@\(sc[\(~A\\\\\(,C]\(~O^\(ha\`\(ga{\(~a|\(,c}\(~o~\(de \" L
X.if "\\*(LA"PT2" .tr #\(sh$\(Do@\(aa[\(~A\\\\\(,C]\(~O^\(ha\`\(ga{\(~a|\(,c}\(~o~\(ti \" g
X.if "\\*(LA"HU" .tr #\(sh$\(Cs@\('A[\('E\\\\\(:O]\(:U^\(ha\`\('a{\('e|\(:o}\(:u~\(a" \" i MSZ 7795/3
X.if "\\*(LA"NO" .tr #\(sh$\(Do@\(at[\(AE\\\\\(/O]\(oA^\(ha\`\(ga{\(ae|\(/o}\(oa~\(rn \" ` NS 4551 - 1
X.if "\\*(LA"NO2" .tr #\(sc$\(Do@\(at[\(AE\\\\\(/O]\(oA^\(ha\`\(ga{\(ae|\(/o}\(oa~\(bv \" a NS 4551 - 2
X.if "\\*(LA"SE" .tr #\(sh$\(Cs@\(at[\(:A\\\\\(:O]\(oA^\(ha\`\(ga{\(:a|\(:o}\(oa~\(rn \" G SEN 850200 B
X.if "\\*(LA"SE2" .tr #\(sh$\(Cs@\('E[\(:A\\\\\(:O]\(oA^\(:U\`\('e{\(:a|\(:o}\(oa~\(:u \" H SEN 850200 C
X.if "\\*(LA"FI" .tr #\(sh$\(Do@\(at[\(:A\\\\\(:O]\(oA^\(ha\`\(ga{\(:a|\(:o}\(oa~\(ti
X.if "\\*(LA"JS" .tr #\(sh$\(Do@\(vZ[\(vS\\\\\(-D]\('C^\(vC\`\(vz{\(vs|\(-d}\(vc~\('c \" z JUS I.B1.002
X..
+ END-OF-FILE tmac.la
chmod 'u=r,g=r,o=r' 'tmac.la'
set `wc -c 'tmac.la'`
count=$1
case $count in
2287) :;;
*) echo 'Bad character count in ''tmac.la' >&2
echo 'Count should be 2287' >&2
esac
echo Extracting 'pchdefs'
sed 's/^X//' > 'pchdefs' << '+ END-OF-FILE ''pchdefs'
X.\" The following characters are not defined on the Agfa P400 printer
X.ds 'E E\h'-\w"E"u-\w"\(aa"u/2u'\v'-.19m'\(aa\v'.19m'
X.ds 'A A\h'-\w"A"u-\w"\(aa"u/2u'\v'-.19m'\(aa\v'.19m'
X.ds ~A A\h'-\w"A"u-\w"\(ti"u/2u'\v'-.19m'\(ti\v'.19m'
X.ds ~O O\h'-\w"O"u-\w"\(ti"u/2u'\v'-.19m'\(ti\v'.19m'
X.ds 'C C\h'-\w"C"u-\w"\(aa"u/2u'\v'-.19m'\(aa\v'.19m'
X.ds 'c c\h'-\w"c"u-\w"\(aa"u/2u'\v'-.19m'\(aa\v'.19m'
X.\" .ds vS S\h'-\w"S"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m'
X.\" .ds vs s\h'-\w"s"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m'
X.\" .ds vc c\h'-\w"c"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m'
X.\" .ds vC C\h'-\w"C"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m'
X.\" .ds vZ Z\h'-\w"Z"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m'
X.\" .ds vz z\h'-\w"z"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m'
X.\" .ds -D D\h'-\w"D"u-\w"\(hy"u/2u'\v'-.19m'\(hy\v'.19m'
X.\" .ds -d d\h'-\w"d"u-\w"\(hy"u/2u'\v'-.19m'\(hy\v'.19m'
+ END-OF-FILE pchdefs
chmod 'u=r,g=r,o=r' 'pchdefs'
set `wc -c 'pchdefs'`
count=$1
case $count in
858) :;;
*) echo 'Bad character count in ''pchdefs' >&2
echo 'Count should be 858' >&2
esac
echo Extracting 'typesetting'
sed 's/^X//' > 'typesetting' << '+ END-OF-FILE ''typesetting'
X%T Adventures with Typesetter\-Independent TROFF
X%A Mark Kahrs
X%A Lee Moore
X%R TR 159
X%C Rochester, NY
X%D June, 1985
X%I University of Rochester
X
X%A J. F. Ossanna
X%T NROFF/TROFF User's Manual
X%D October 1976
X%R Comp. Sci. Tech. Rep. 54
X%I Bell Laboratories
X%C Murray Hill, NJ
X
X%A B.W. Kernighan
X%T A typesetter-independent TROFF
X%D March 1982
X%R Comp. Sci. Tech. Rep. 97
X%I Bell Laboratories
X%C Murray Hill, NJ
X
X%A J. Akkerhuis
X%T Unknown
X%J Proceedings of the European Unix\(tm System User Group Autumn Meeting
X%D 7-9 september 1981
+ END-OF-FILE typesetting
chmod 'u=rw,g=r,o=r' 'typesetting'
set `wc -c 'typesetting'`
count=$1
case $count in
533) :;;
*) echo 'Bad character count in ''typesetting' >&2
echo 'Count should be 533' >&2
esac
exit 0hansen@pegasus.att.com (Tony L. Hansen) (12/27/90)
< From: heimir@rhi.hi.is (Heimir Thor Sverrisson) << There are some methods of data interchange, such as most email systems, << that are inherently 7-bit. It would be nice if we could just banish << them, but compatibility is an albatross. < There is nothing that tells me that email systems should be _inherently < 7-bit_. In fact here in Iceland we have to hack almost every piece of < communications software to be able to use it in our _inherently 8-bit < environment_. This < < I can see no reason at all for some stupid mailers to strip off the < eighth bit (including Interactive's version of sendmail). They don't < have to - and should not - interpret the contents of the mail they are < transmitting. This is quite different from the troff situation where a < program has to know a lot about it's input set. So why don't you guys < simply open up your mailers and be ready with a 8-bit clean version by < the end of 1991! 1991? Why not now? The System V release 4.0 mail program is completely 8-bit clean! (If you can find anyplace within SVr4 mail which isn't, I'll personally guarantee that the next version of mail which comes from UNIX System Laboratories will have a fix for the problem.) By the way, the SMTP protocol doesn't permit 8-bit data. This limits mailers which must send mail using that protocol. Tony Hansen att!pegasus!hansen, attmail!tony hansen@pegasus.att.com
keld@login.dkuug.dk (Keld J|rn Simonsen) (12/28/90)
hansen@pegasus.att.com (Tony L. Hansen) writes: >< From: heimir@rhi.hi.is (Heimir Thor Sverrisson) >< I can see no reason at all for some stupid mailers to strip off the >< eighth bit (including Interactive's version of sendmail). They don't >< have to - and should not - interpret the contents of the mail they are >< transmitting. This is quite different from the troff situation where a >< program has to know a lot about it's input set. So why don't you guys >< simply open up your mailers and be ready with a 8-bit clean version by >< the end of 1991! >1991? Why not now? The System V release 4.0 mail program is completely 8-bit >clean! (If you can find anyplace within SVr4 mail which isn't, I'll >personally guarantee that the next version of mail which comes from UNIX >System Laboratories will have a fix for the problem.) I have also done patches to sendmail 5.61 and 5.64 to do 8-bit mail. The patches require that you also have IDA installed. Actually it handles quite some different 8-bit character sets like 8859-1 8859-2 and the rest of the 8859 family and vendor character sets like the IBM codepages. In all about 60 character sets are supported in the current release. >By the way, the SMTP protocol doesn't permit 8-bit data. This limits mailers >which must send mail using that protocol. My package also has provisions for sending 8-bit mail thru 7-bit SMTP in an "encoded ASCII" mode. My package is avaliable in dkuug.dk:pub/sm.8+bit.pa sm5.64.8+bit.pa and ch.shar by anon ftp. By mail you can get it by mailing mail uunet!dkuug.dk!archive Subject: files pub Names: sm.8+bit.pa sm5.64.8+bit.pa ch.shar Its about 100 kb. Enjoy! Keld Simonsen
tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) (12/28/90)
hansen@pegasus.att.com (Tony L. Hansen) writes: > > By the way, the SMTP protocol doesn't permit 8-bit data. This limits > mailers which must send mail using that protocol. True. But there is no technical reason (other than short-sightedness) why SMTP has to strip off the 8th (high) bit. There are in fact working versions of sendmail that don't disturb the 8th bit.
bruce@balilly.UUCP (Bruce Lilly) (12/28/90)
In article <1990Dec27.043500.27639@cbnewsk.att.com> hansen@pegasus.att.com (Tony L. Hansen) writes: >< From: heimir@rhi.hi.is (Heimir Thor Sverrisson) > ><< There are some methods of data interchange, such as most email systems, ><< that are inherently 7-bit. It would be nice if we could just banish ><< them, but compatibility is an albatross. > >< So why don't you guys >< simply open up your mailers and be ready with a 8-bit clean version by >< the end of 1991! > >1991? Why not now? The System V release 4.0 mail program is completely 8-bit >clean! (If you can find anyplace within SVr4 mail which isn't, I'll >personally guarantee that the next version of mail which comes from UNIX >System Laboratories will have a fix for the problem.) Is AT&T SVR4 available for the AT&T 3B1 and/or 7300 UNIX(TM)PC? ^^^^ ^^^^ No? You've got to be kidding. Reality check: it is impossible for everybody everywhere to upgrade (if that's the right term (it might be)) to version N of any software. Even if economic limitations are ignored, other considerations, such as inertia, inadequate disk space, compatibility requirements with other software and (ahem :-) lack of vendor support prevent many people from upgrading. Of course, if you can make a port of SVR4 available at a reasonable price for the 3B1, I'll upgrade. (i.e. put up or shut up) Don't take this too seriously -- I'm really a big fan of AT&T. I just get somewhat aggravated when I'm told (by AT&T) that I can't purchase (at any price) AT&T software to run on my AT&T hardware (like the time I asked if WWB was available for the 3B1). > Tony Hansen > att!pegasus!hansen, attmail!tony > hansen@pegasus.att.com -- Bruce Lilly blilly!balilly!bruce@sonyd1.Broadcast.Sony.COM
hansen@pegasus.att.com (Tony L. Hansen) (12/31/90)
<< From: hansen@pegasus.att.com (Tony L. Hansen) << By the way, the SMTP protocol doesn't permit 8-bit data. This limits << mailers which must send mail using that protocol. < From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) < True. But there is no technical reason (other than short-sightedness) < why SMTP has to strip off the 8th (high) bit. There are in fact < working versions of sendmail that don't disturb the 8th bit. I agree completely, there is no reason to limit SMTP to 7-bits. Unfortunately, the standard currently REQUIRES the stripping and doing anything else is non-standard. I would definitely support changing the standard to allow an arbitrary 8-bit byte stream. This would also require eliminating the limitation of 1024-byte lines and anything else in the standard which is not content transparent. System V release 4 mail is completely content transparent. As long as the transport media is capable of handling the mail, SVr4 mail will be able to get it to you unchanged. Unfortunately, it can't do so over SMTP connections. Since this discussion is going somewhat away from the bounds of comp.text, I've added comp.mail.misc to the Newsgroup list. Tony Hansen att!pegasus!hansen, attmail!tony hansen@pegasus.att.com
barmar@think.com (Barry Margolin) (12/31/90)
In article <1990Dec31.004055.10335@cbnewsk.att.com> hansen@pegasus.att.com (Tony L. Hansen) writes: >System V release 4 mail is completely content transparent. As long as the >transport media is capable of handling the mail, SVr4 mail will be able to >get it to you unchanged. What does it do when sending textual mail to a system that doesn't use ASCII encoding, e.g. an IBM mainframe, or to a system with a different newline convention (e.g. CRLF rather than LF)? SMTP places restrictions on the characters that may appear in a message to support automated translation during the transfer process. -- Barry Margolin, Thinking Machines Corp. barmar@think.com {uunet,harvard}!think!barmar
keld@login.dkuug.dk (Keld J|rn Simonsen) (01/02/91)
hansen@pegasus.att.com (Tony L. Hansen) writes: >< From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) >< True. But there is no technical reason (other than short-sightedness) >< why SMTP has to strip off the 8th (high) bit. There are in fact >< working versions of sendmail that don't disturb the 8th bit. This introduces a problem with "embedded slashes" which are now represented internally in Sendmail with the 8th bit set. Have anybody got Sendmail patches to remedy this? >I agree completely, there is no reason to limit SMTP to 7-bits. >Unfortunately, the standard currently REQUIRES the stripping and doing >anything else is non-standard. I would definitely support changing the >standard to allow an arbitrary 8-bit byte stream. This would also require >eliminating the limitation of 1024-byte lines and anything else in the >standard which is not content transparent. I am much in favour of extending the character set supported by SMTP. But you should be careful. What is the meaning of a 8-bit character? Well, depends on the character set employed. Today we know that only 7-bit ASCII is allowed. But with 8-bit mail, is this octal code 0162 coming over the line an "small a with acute accent" (as in ISO 8859-1:1987), a Cent sign (as in IBM CP 437) or a "capital A with circumflex" (as in HP Roman8)? This might become a real problem given the current shares on the UNIX market. Just displaying the 8bit data to a user may be very confusing. It may even do strange things to your terminal equipment if IBM Codepage character set is employed, as some of the characters here are in the upper control character sets of ISO 8859 and other vendors chararacter sets. Should one then just say "Use ISO 8859"? Well, what ISO 8859? There are several parts, latin 1, latin 2 (eastern Europe), Greek, Cyrillic, Arabic, Hebrew (among others). The abovementioned character 0162 has different meanings in these different character sets. ISO 8859-1 would be the natural choice (and is also specified in a recent RFC on encoding: header.) But is that fair? I think that is like inventing a new ASCII, only capable of serving one region of the world sufficiently - this time having Western Europe (EEC) and all of North and South America covered. We should do something that could cover the whole world. It is also quite hard to persuade your manufactures to change their implementation character set, and even worse for equipment you already have bought and installed. Some of this may even be running software with no 8-bit capabilities! I think it would be nice to be able to support all of these new and oldie systems, and I have done an implementation of Sendmail capable of supporting more than 60 character sets. It currently does not touch the headers, but only the mail body. For characters not in the current character set, it encodes this character with a mnemonic code, for example a' for the above mentioned "small a with acute". Thus even in ASCII you can get the message! The sendmail patches are available with anon ftp in dkuug.dk:pub/ch.shar and sm5.64.8+bit.pa (sm.8+bit.pa for 5.61). Its about 100 kb - the Sendmail patches itself is under 100 lines, the rest is the character set stuff. It has been running here at dkuug.dk since Feb 90. A new ISO standard is showing up: ISO 10646 (which just has been published as a Draft International Standard (DIS)). This covers all characters in the world, with very few exceptions. And the exceptions are planned to be included in a later issue. Actually Dan Oscarsson and I have been planning (mostly Dan) to do a SMTP implementation for Sendmail negotiation 10646 for transmission, and write an RFC for this character set negotiation. Keld Simonsen
jacob@gore.com (Jacob Gore) (01/02/91)
/ comp.text / keld@login.dkuug.dk (Keld J|rn Simonsen) / Jan 1, 1991 / > Should one then just say "Use ISO 8859"? Well, what ISO 8859? > There are several parts, latin 1, latin 2 (eastern Europe), > Greek, Cyrillic, Arabic, Hebrew (among others). The abovementioned > character 0162 has different meanings in these different character sets. Specify the character set in the header. For example, Char-Encoding: ISO-8559-Latin-1 Jacob -- Jacob Gore Jacob@Gore.Com boulder!gore!jacob
les@chinet.chi.il.us (Leslie Mikesell) (01/02/91)
In article <1990Dec31.013538.9473@Think.COM> barmar@think.com (Barry Margolin) writes: >In article <1990Dec31.004055.10335@cbnewsk.att.com> hansen@pegasus.att.com (Tony L. Hansen) writes: >>System V release 4 mail is completely content transparent. As long as the >>transport media is capable of handling the mail, SVr4 mail will be able to >>get it to you unchanged. >What does it do when sending textual mail to a system that doesn't use >ASCII encoding, e.g. an IBM mainframe, or to a system with a different >newline convention (e.g. CRLF rather than LF)? SMTP places restrictions on >the characters that may appear in a message to support automated >translation during the transfer process. But the automated translation can currently only work with text while many mailers are now capable of attaching arbitrary binary data to messages. Depending on the type of the content, a different transformation (or none) may be desired. Assuming that the non-textual portions are encapsulated with "Content-Type:" and "Content-Length:" headers, it would be easy for the transport to determine what, if any, transformation to use. In addition, an optional "Encoding-Method:" header can allow temporary transformations to meet the character set requirements of the transports. If the sending program had a way to determine the capabilities of the recipient, encoding could be done on-the-fly, using uuencode or atob, and thus only done where necessary (but I don't know of anyone actually doing this yet...). These issues are going to have to be addressed for messages originating on X.400 systems anyway, so why not try to do it efficiently by adding the equivalent functionality to SMTP/uucp mailers? Les Mikesell les@chinet.chi.il.us
tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) (01/03/91)
keld@login.dkuug.dk (Keld J|rn Simonsen) writes: > > Should one then just say "Use ISO 8859"? Well, what ISO 8859? > There are several parts, latin 1, latin 2 (eastern Europe), > Greek, Cyrillic, Arabic, Hebrew (among others)... > We should do something that could cover the whole world. This is what Unicode is for. Unicode should be considered the most useful and implementable subset of the draft standard ISO 10646. Unicode is an unambiguous fixed-length 16-bit global codeset currently under development by the Unicode Consortium. Unicode offers a uniform text and character standard that can encompass all living languages and form a long-lasting basis for worldwide data exchange. Unicode makes all 65,535 slots available, with these constraints: o The first 256 slots duplicate the arrangement of ASCII and ISO Latin-1. o Characters unique to a language are grouped together in standard order. o Letters, punctuation, symbols, and diacritics shared by multiple languages are grouped together. o Asian pictographs are grouped together in order of frequency (as specified by national standards), then sorted in traditional radical/stroke order. o Chinese, Japanese, and Korean phonetic symbols are grouped together by language in standard order. The reason 16 bits are enough is that Asian pictographs which everyone would recognize as the same have been unified. Thus, more than 31,000 characters have been reduced to about 20,000 slots. Major Han Character Standards Country Standard Year Characters ------- -------- ---- ---------- China GB 2312 1980 6,763 Japan JIS X0208 1983 6,349 Korea KS C5601 1987 4,888 Taiwan CNS 11643 1986 13,051 ------ total 31,051 In addition to East Asian languages, here are the writing systems currently available in Unicode: Greek, Cyrillic, Georgian, Armenian, Hebrew, Arabic, Ethiopian, Devanagari, Bengali, Gurmukhi, Gujarti, Oriya, Tamil, Telegu, Kannada, Malayalam, Sinhalese, Thai, Lao, Burmese, Khmer, Tibetan, and Mongolian.
keld@login.dkuug.dk (Keld J|rn Simonsen) (01/03/91)
tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) writes: >keld@login.dkuug.dk (Keld J|rn Simonsen) writes: >> >> Should one then just say "Use ISO 8859"? Well, what ISO 8859? >> There are several parts, latin 1, latin 2 (eastern Europe), >> Greek, Cyrillic, Arabic, Hebrew (among others)... >> We should do something that could cover the whole world. >This is what Unicode is for. Unicode should be considered the most >useful and implementable subset of the draft standard ISO 10646. Is UNICODE a true subset of ISO 10646? Is there a well defined relation between ISO 10646 encoding and UNICODE? Seasons greetings! Keld
tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) (01/04/91)
keld@login.dkuug.dk (Keld J|rn Simonsen) writes: > > Is UNICODE a true subset of ISO 10646? > Is there a well defined relation between ISO 10646 encoding and UNICODE? ISO 10646 is still in draft form. Both questions are impossible to answer until 10646 gets finalized. Disclaimer: I'm not an expert in this area. However, extrapolating from what I know, it appears that Unicode could be considered a 16-bit implementation of 10646. The ISO 10646 draft standard appears to permit 16-bit implementations of any subset thereof, for use in process code or communication. It just so happens that Unicode covers all Asian characters enumerated by existing national standards, plus characters from languages that the 10646 draft hasn't even thought about. So it may be a subset, but a largely complete subset. Lee Collins writes: > Notice that 10646 would require 93,816 separate codes to cover existing > [Chinese/Japanese/Korean] standards. Han Unification allows Unicode to > cover the same standards with only 18,739 unique characters. Ken Whistler writes: > Unicode 1.0 also includes the following scripts omitted from DIS 10646: > Ethiopian, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, > Malayalam, Sinhalese, and Lao. There have been attempts to convert Unicode to 10646 and back again, I believe with mostly good results. Of course, some data may be lost in the translation.