npn@cbnewsl.att.com (nils-peter.nelson) (12/18/90)
We have gotten several requests for support of international character sets, and for "8 bit clean" troff. We need some help figuring out what is needed. Ignoring for the moment Kanji as a separate case, this inquiry is directed to the European market. If we select a specific character, say the Swedish character that looks like an a with a circle over it, as in the second character of my ancestral home town, Bastad, then the following implementations are possible: 1. Using the current DWB troff longname convention, you can address any PostScript character, viz. "B\N'aring'stad" (aring is the Adobe PostScript name for this character, which, for the non-Swedes, in pronounced "oh"). 2. We could invent a new troff escape, viz. "B\(aostad". We don't want to invent one if there is an established convention. 3. There are apparently 8 bit internal representations that are created by double striking keys (perhaps ESCAPE A, but I don't know). I need to know if there is a practical standard for such (i.e., one people actually use). 4. I recall some countries (e.g., Scandinavian) use ordinary 7 bit punctuation chars in lieu of these letters, viz. "B{stad" or somesuch. I don't want to hear about every pending international standard. I want to hear what people who will be using the software would like to see. If you have answers to the above, please send them to npn@mhuxo.att.com. I am looking for data, not opinions, so replies from Europe will be heavily weighted over others. I am not trying to create another "gaol" vs. "jail" controversy.
yfcw14@castle.ed.ac.uk (K P Donnelly) (12/19/90)
I have come in on this discussion from the outside, but it sounds as if you have misunderstood the requests. It sounds to me as if what people are asking for is for troff to stop stripping the eighth bit off characters in the input file, but instead to pass them to the output file just like (7-bit) ASCII characters. There is no need to invent new (7-bit) ASCII representations of non ASCII characters, such as \ao for a-ring. Such representations may be desirable for other purposes, but that is a separate issue. Anyone with a VT220 or VT320 terminal can input a-ring using the three character sequence <Compose-character> a * the character gets hexadecimal code E5 in compliance with the ISO standard for western European languages. You see it on the screen and edit it just like any other character using any editor (such as the version 3.10 of microEmacs) which doesn't strip the eighth bit. However, it gets very frustrating if the software which gets between you and the laser printer insists for no good reason on stripping the eighth bit and turning the a-ring into an 'e' (ASCII code 65 hex). There is lots of such software around, especially on Unix. I think it is something to do with the eighth bit having been used for parity check in the past, so the software thought it was safest to filter it out. Nowadays, with cleaner communications lines, the eighth bit is hardly ever used for parity check - it wasn't a very good system anyway - and software packages which in the past have stripped the eighth bit are one by one changing their policies - witness Kermit 3.0, TeX 3.0, microEmacs 3.10. The Scandinavians have up til now used "national versions" of ASCII in which characters like { } ~ | are replaced by national characters like a-ring. This often makes their names look weird in mail signatures, and must cause them a lot of trouble when programming in languages like C and Pascal. The Germans use in computing the alternative system of placing an 'e' after the vowel instead of an umlaut sign above it. The French have such a variety of accented characters that in computing (mail messages and so on) they usually give up and leave out the accents. Sometimes they use devices like puting an apostrophe before or after the vowel to indicate an acute accent. On the Gaelic language bulletin board in which I participate, we always use a slash, '/', after the vowel to indicate an acute accent. The Icelanders have many more non-ASCII characters in their language than other Scandinavians, including some unique ones such as 'eth' and 'thorn' where you can't just "leave out the accent", so they have for long been ahead of the world in using 8-bit text in their computing. Kevin Donnelly
rcd@ico.isc.com (Dick Dunn) (12/20/90)
yfcw14@castle.ed.ac.uk (K P Donnelly) writes: > It sounds to me as if what people are asking for is for troff to stop > stripping the eighth bit off characters in the input file, but instead > to pass them to the output file just like (7-bit) ASCII characters. It's not at all that simple. Troff has to know about the characters--it needs to be able to find them in its width tables and know whether the characters have ascenders and/or descenders (for sb/st/ct number regs). There's also an issue of whether troff should produce 8-bit codes on its output--there are some good arguments that it should not. The matter of 7-bit data paths is rather more complicated (and clumsy) than the single issue of a parity bit that Donnelly mentions. There are some methods of data interchange, such as most email systems, that are inherently 7-bit. It would be nice if we could just banish them, but compatibility is an albatross. The issue of inventing alternate representations, such as \(ao for "a ring" goes beyond the issue of simple 8-bit transparency. There are many more characters needed than can be represented in an 8-bit code set. Certainly one wants a conventional 8-bit set (such as Latin 1) for convenience, but more characters are needed even for European usage. It is useful to have a canonical representation in terms of 7-bit codes even if it's not the most commonly used. > The Scandinavians have up til now used "national versions" of ASCII in > which characters like { } ~ | are replaced by national characters like > a-ring... These are not ASCII. They are national versions of ISO 646. If you like, you could think of ASCII as a "national version" of ISO 646 used in the USA. 646 provides a few codes which are reserved for national characters; ASCII provides a particular assignment to those codes. The Scandinavian conventions are simply different assignments. > ...The Germans use in computing the alternative system of > placing an 'e' after the vowel instead of an umlaut sign above it... This alternative representation far predates computer usage, although it is certainly a convenient solution. Note also that scharfes ess turns into "ss". -- Dick Dunn rcd@ico.isc.com -or- ico!rcd Boulder, CO (303)449-2870 ...Mr. Natural says, "Use the right tool for the job."
clewis@ecicrl.UUCP (Chris Lewis) (12/21/90)
In article <1990Dec20.012516.23623@ico.isc.com> rcd@ico.isc.com (Dick Dunn) writes: >yfcw14@castle.ed.ac.uk (K P Donnelly) writes: >> It sounds to me as if what people are asking for is for troff to stop >> stripping the eighth bit off characters in the input file, but instead >> to pass them to the output file just like (7-bit) ASCII characters. >It's not at all that simple. Troff has to know about the characters--it >needs to be able to find them in its width tables and know whether the >characters have ascenders and/or descenders (for sb/st/ct number regs). Donnelly is right though, that's primarily what people *do* want. (Note that 8-bit clean in ditroff doesn't imply that ditroff is passing the 8-bit characters directly to the printer - on the contrary, it only means that 8-bit characters can appear in the ditroff *input* file file, and 8-bit characters can be found (somehow) in the width tables. There's no particular reason that the ditroff format output file (troff(5)) actually contains 8-bit characters, for this file isn't really intended to be read, and extensions to the "c<char>" directive could be altered to permit octal or some other 7-bit clean representation if necessary. [Refresher on ditroff guts: troff document -> ditroff -> ditroff intermediate -> filter -> printer ^ ^ | | +-------------+--------------------+ | width tables The width tables contain the width and kerning information that ditroff needs to know for character placement, and also contains the byte that the filter emits for each character (though, the filter doesn't have to use them). The ditroff intermediate is a displayable file with simple commands indicating character placement, font size, points etc.] It shouldn't be all that much harder for ditroff to permit 8-bit characters in the width tables. After all, it does permit octal sequences in the fourth field (the character the backend is to emit to generate the desired glyph). It would be nicest if the left most column (the character ditroff is searching for) could be 8-bit, but octal would probably serve in a pinch, permitting both would be even better (and would permit transmission of these files over 7-bit paths/editting via 7-bit vi's etc.) Psroff's analogous tables do permit this. T'would be especially nice now that the newest vi's are now 8-bit clean. And emacs is now as well. >There's also an issue of whether troff should produce 8-bit codes on >its output--there are some good arguments that it should not. The matter >of 7-bit data paths is rather more complicated (and clumsy) than the single >issue of a parity bit that Donnelly mentions. There are some methods of >data interchange, such as most email systems, that are inherently 7-bit. >It would be nice if we could just banish them, but compatibility is an >albatross. Since you're talking about troff generating 8-bit codes on its output, I'm not sure that this is a real issue, because the intermediate ditroff format output isn't really an interchange format. Regardless, what people want is the ability to jam 8-bit characters into the input of ditroff and have it do sane things, not necessarily their representation in the intermediate file. As it is now, the width tables *do* permit the filter to emit 8-bit characters to the printer - they *have* to. On that note, you might have trouble getting 8-bit characters to the printer, but that's the OS's/system administrator's/printer designer's fault. On the other hand, permitting ditroff to accept 8-bit characters on *input* may get people into trouble when they try to mail something through a 7-bit path. But it isn't all that difficult to solve, either by uuencoding (or similar) or having a program that converts the 8-bit characters to the \(xx convention (and vice-versa) (I'm in fact going to be implementing something like this in Psroff). People solve it all the time when shipping PC binaries. Requiring all of Europe to have to type those silly 4-character sequences when trying to edit documents in their own language when 8-bit is *easy* isn't a very nice thing to do to faithful customers. And it isn't *just* Europe. It's Canada too. (I'm a member of the CSA/Treasury Board Canadian Posix Working Group). Canada is also trying to encourage Latin-1 because of bilingual (English/French) requirements both in government and in the private sector. 7-bit ASCII is very nearly *only* the USA (most other English speaking countries are either tending towards Latin-1, or a different version of 646. 646 in all it's national variants only satisfies completely a minority of the Roman-alphabet countries). Lest one think that Canada is a minor addition to Western Europe in this context, one should remember that Canada is the US's biggest single trading partner by a substantial margin (considerably larger than Japan). The only market bigger than Canada is the EEC taken as a unit (aka Western Europe, aka the other Latin-1 countries). Markets, markets! Heck, if you're ever even thinking about Kanji, you really should satisfy the considerably larger group of customers that need Latin-1. >The issue of inventing alternate representations, such as \(ao for "a ring" >goes beyond the issue of simple 8-bit transparency. It *isn't* transparency, on the contrary. HOWEVER, having both the 8-bit input transparency as well as alternate representations that need only 7-bit would be a definate plus, so that people with 646 compliant terminals can do the same things that the newer Latin-1 ones can. And it ain't all that hard to do. Hell, if I can do it in psroff via CAT troff without source, AT&T should be able to do it with ditroff. -- Chris Lewis, Phone: (613) 832-0541 UUCP: uunet!utai!lsuc!ecicrl!clewis Moderator of the Ferret Mailing List (ferret-request@eci386) Psroff mailing list (psroff-request@eci386)
heimir@rhi.hi.is (Heimir Thor Sverrisson) (12/24/90)
In <1990Dec20.012516.23623@ico.isc.com> rcd@ico.isc.com (Dick Dunn) writes: > There are some methods of >data interchange, such as most email systems, that are inherently 7-bit. >It would be nice if we could just banish them, but compatibility is an >albatross. There is nothing that tells me that email systems should be _inherently 7-bit_. In fact here in Iceland we have to hack almost every piece of communications software to be able to use it in our _inherently 8-bit environment_. This I can see no reason at all for some stupid mailers to strip off the eighth bit (including Interactive's version of sendmail). They don't have to - and should not - interpret the contents of the mail they are transmitting. This is quite different from the troff situation where a program has to know a lot about it's input set. So why don't you guys simply open up your mailers and be ready with a 8-bit clean version by the end of 1991! -- Heimir Thor Sverrisson heimir@rhi.hi.is
keld@login.dkuug.dk (Keld J|rn Simonsen) (12/26/90)
I was the coauthor of an article on 7-bit names for troff of non-ASCII characters, together with some support for national ISO 646 variants. Here is the article. Enjoy! : This is a shar archive. Extract with sh, not csh. : This archive ends with exit, so do not worry about trailing junk. : --------------------------- cut here -------------------------- PATH=/bin:/usr/bin:/usr/ucb echo Extracting 'Makefile' sed 's/^X//' > 'Makefile' << '+ END-OF-FILE ''Makefile' Xextchar.dit: extchar specchar X refer -e -p typesetting extchar | tbl | troff $(DEVICE) -ms >extchar.dit X Xspecchar: spec nicetr X sh nicetr spec >specchar X Xprint: extchar.dit X dip $(DEVICE) extchar.dit X Xallchar: spec nicetr spec.p400 X sh nicetr spec spec.p400 >allchar X Xdistr: X shar Makefile extchar spec nicetr tmac.la pchdefs typesetting >distr + END-OF-FILE Makefile chmod 'u=r,g=r,o=r' 'Makefile' set `wc -c 'Makefile'` count=$1 case $count in 351) :;; *) echo 'Bad character count in ''Makefile' >&2 echo 'Count should be 351' >&2 esac echo Extracting 'extchar' sed 's/^X//' > 'extchar' << '+ END-OF-FILE ''extchar' X.ds LF DRAFT X.so pchdefs X.so tmac.la X.la US X.TL XAn extension to the troff character set X.AU XE.G. Keizer XK.J. Simonsen XJ. Akkerhuis X.AI XVrije Universiteit, Amsterdam, The Netherlands XUniversity of Copenhagen, Copenhagen, Denmark XC.M.U., USA X.AB XThe typesettting program X.I troff Xwas originally written for formatting English text for the CAT 48 Typesetter. XIts offspring is used for formatting a variety of languages with a large Xdiversity of output devices. XThe authors agreed on an addition Xto the troff character set covering old and new national and Xinternational latin based character sets. X.AE X.NH XThe problems X.PP XWhen adapting the X.UX Xtypesetting program X.I troff\| X.[ XOssanna X.] Xto a new output device, one wants to have access Xto the extra characters offered by the device, Xwithout sacrificing any characters already in use. XDevice independent \f2troff\fP, also called X.I titroff Xor X.I ditroff\| X.[ XTypesetter independent troff Kernighan X.] Xuses a flexible font definition mechanism that allows addition and deletion Xof characters. X.LP XMany people, including the authors, have used this mechanism X.[ XKahrs and Moore X.] Xto add characters. XThis has led to a diversity of names, with the expected conflicts Xof using the same name for different characters and different names for the same Xcharacter in different implementations. XThus X.I troff Xinput files are becoming less and less portable, Xeven for the same output device on different installations. XWe regret this development and, during a conference in Copenhagen, Xwe decided to make an attempt at some standardization. XB.W. Kernighan, author of ditroff, agreed to our proposal of acting as a Xclearing house for our new names and he still has to give his blessing Xto this article. X.PP XWe realize that it is impossible to name every printable character in the Xworld. XThe total amount of different characters is simply too huge. XNaming all the hundred different turtles in a turtle font is both Xfrustrating and futile. XWe restricted ourselves to the following categories: X.IP \(bu Xcharacters belonging to the printed language of several Western-European Xcountries: \(AE \(ss X.IP \(bu Xvariations of letters, especially with accents: \(:i \(^u \(:o X.IP \(bu Xoften used mathematical symbols: \(AN \(OR \(c* X.NH 2 XNatural language support X.PP XThe X.I troff Xcharacter set is based on the US-ASCII standard. XThis standard is well suited to English text, Xbut causes problems when used in most European countries. XUS-ASCII is a version of the ISO 646-1983 standard. XThe ISO standard states the characters used in 7 bit ASCII Xand contains 94 printable graphic symbols. XISO-646 allows national versions for 12 of its character positions. XThe European Computer Manufacturers' Association ECMA is registering Xall different national versions of ISO 646-1983 and assigns Xa different character to each. XThis allows the creation of documents with multiple character sets. XThe assigned character serves to identify each character set in such documents. XThe table below shows several versions conforming to ISO 646. X.TS Xcenter,allbox; Xc s s s s s s s s s s s s s s Xl l l s s s s s s s s s s s s Xl l l c c c c c c c c c c c c. XNational ISO 646 character sets XCountry Standard .la parameter XISO ISO 646 IRV ISO \(sh \(Cs \(at \(lB \(rs \(rB \(ha \(oq \(lC \(ba \(rC \(rn XUSA X3.4-1968 US \(sh \(Do \(at \(lB \(rs \(rB \(ha \(oq \(lC \(ba \(rC \(ti XGreat Britain BS 4730 GB \(Po \(Do \(at \(lB \(rs \(rB \(ha \(oq \(lC \(ba \(rC \(rn XJapan JIS C 6229 JP \(sh \(Do \(at \(lB \(Ye \(rB \(ha \(oq \(lC \(ba \(rC \(rn XChina GB 1988-80 CN \(sh \(Ye \(at \(lB \(rs \(rB \(ha \(oq \(lC \(ba \(rC \(rn XDenmark DS 2089 DK \(sh \(Do \(at \(AE \(/O \(oA \(ha \(oq \(ae \(/o \(oa \(ti XNorway NS 4551-1 NO \(sh \(Do \(at \(AE \(/O \(oA \(ha \(oq \(ae \(/o \(oa \(rn X.la NO2 X NS 4551-2 NO2 # $ @ [ \(/O ] ^ ` { | } ~ X.la FI XFinland FI # $ @ [ \e ] ^ ` { | } ~ XSweden SEN 850200 B SE \(sh \(Cs \(at \(:A \(:O \(oA \(:U \(oq \(:a \(:o \(oa \(rn X SEN 850200 C SE2 \(sh \(Cs \*('E \(:A \(:O \(oA \(:U \('e \(:a \(:o \(oa \(:u XGermany DIN 66 003 DE \(sh \(Do \(sc \(:A \(:O \(:U \(ha \(oq \(:a \(:o \(:u \(ss XHungary MSZ 7795/3 HU \(sh \(Cs \*('A \*('E \(:O \(:U \(ha \('a \('e \(:o \(:u \(a" X.\" Jugoslavia JUS I.B1.002 JS \(sh \(Do \*(vZ \*(vS \*(-D \*('C \*(vC \*(vz \*(vs \*(-d \*(vc \*('c X.la FR XFrance NF Z 62-010 FR \(Po \(Do \(`a \(de \(,c \(sc \(ha \(my \('e \(`u \(`e \(ad XItaly IT \(Po \(Do \(sc \(de \(,c \('e \(ha \(`u \(`a \(`o \(`e \(`i X.la ES XSpain ES # $ @ [ \e ] ^ ` { | } ~ X.la ES2 X ES2 # $ @ [ \e ] ^ ` { | } ~ XPortugal PT \(sh \(Do \(sc \*(~A \(,C \*(~O \(ha \(oq \(~a \(,c \(~o \(de X PT2 \(sh \(Do \(aa \*(~A \(,C \*(~O \(ha \(oq \(~a \(,c \(~o \(ti X.TE X.la US X.vs X.ps X.PP XTerminals in these countries are often Xadapted to these national variations. XCreating X.I troff Xinput for Danish texts on Danish terminals is a frustrating experience. XOne has to type X.B \e(AE Xfor X.B \(AE Xin spite of the presence of a special key for \(AE. XYou can by-pass this by using the X.I troff Xcommand \f3.tr\fP, Xwhich allows the mapping of any character to any other. XWe have employed the \f3.tr\fP command in a macro \f3.la\fP Xwhich is designed to make it possible to shift between Xall the ISO 646 input charater sets in the above table. XThe macro takes a code for the country as parameter; the first Xtwo letters being the ISO 3166 two-letter country code. XThe \f3.la\fP macro can be used like this: X X.ft CW X.nf X .la US X First we write something in "God's own" character set. X .la DK X S} skriver vi noget s|dt p} dansk: sodavandsis. X .la DE X Und f}r Deutschen k|nnen wir auch etwas schreiben! X .la US X.fi X X.ft Xgiving: X.br X.ft XFirst we write something in "God's own" character set. X.la DK XS} skriver vi noget s|dt p} dansk: sodavandsis. X.la DE XUnd f}r Deutschen k|nnen wir auch etwas schreiben! X.la US X.ft X XThe \f3.la\fP Xmacro can only be used when all the characters of the character sets in use Xhave a unique code on the printing device. Also you cannot change input Xcharacter set within a diversion in X.I troff, Xyou need to use the special character names if you want to use foreign Xcharacters within a sentence. An example of having a French name Xin a Danish text: X X.ft CW X.nf XJeg s} Jer\e(^ome Fran\e(,cois komme til K|benhavn. X.ft Xgiving: X.ft X.la DK XJeg s} Jer\(^ome Fran\(,cois komme til K|benhavn. X.ft X.fi X.la US X.PP XTo be able to use all other national characters within a national Xcharacter set, Xwe decided to introduce names for all the different national Xcharacters, Xeven for the 'default' US names. X.PP XAlso we went through the new ISO standards for Latin alphabets X(ISO 8859) Xand assured that all special characters there would have a unique Xname according to this proposal. X.PP XA last warning is about the X.I troff Xescape character \e and the national characters taking its place. XHere you must write the national character followed by an 'e' Xto get the desired result. X XThe \f3.la\fP macro has the follwing contents: X.DS X.ft CW X.ps 6 X.nr Sw \w' ' X.ta 8u*\n(Swu 16u*\n(Swu 24u*\n(Swu 32u*\n(Swu 40u*\n(Swu 48u*\n(Swu 56u*\n(Swu X.eo X.nf X.cc & X&so tmac.la X&cc . X.fi X.ec X.ps X.DE X.ft 1 X.NH 2 XProblems we did not pursue X.PP XWritten English is based on the latin alphabet, Xit hardly uses accents and other variations of letters. XOther languages make use of accents above (\(`e), below (\(,c) and Xthrough (\(/o) the letters. XWe do not address the problems of languages with more than one Xaccent per letter and X. \"(\o'v\(aa'\h'-\w'v'u'\v'0.1m',.\v'-0.1m'\h'0.2m'), Xaccents connecting letters. XIn our opinion these problems have to be solved by separate preprocessors. XThis allows a much more friendly user interface. XThese preprocessors could also solve the problem of hyphenation Xand ligatures for these languages. X.NH 2 XThe troff naming scheme X.PP X.I Troff Xhas three ways of naming characters: X.IP \(bu Xone character ASCII names like X.B A Xfor A, X.B B Xfor B Xand X.B @ Xfor @. X.IP \(bu Xescaped one character names prefixed by a X.B \e Xlike X.B \e\- Xfor current font minus, X.B \ee Xfor backslash Xand X.B \e\' Xfor acute accent. XThere are only a few of these. X.IP \(bu Xtwo character names prefixed by the indicator X.B \e( Xlike X.B \e(sc Xfor \(sc, X.B \e(*g Xfor \(*g Xand X.B \e(14 Xfor \(14. X.LP XThe sets of one-character names and escaped one character names are fixed. XOnly the set of two character names can be extended. X.NH XChoosing new names X.PP XWhile choosing names for new characters we were very much aware Xof the fact that the restriction of two characters per name Xdefies all attempts to choose a consistent and logical naming scheme. XStill we used a few principles in choosing the new names. XWhenever these principles conflicted we refrained from long Xdiscussions but placed more value on a quick decision. X.LP XOur principles: X.IP \(bu Xmay not conflict with original troff manual X.[ XOssanna X.] X.IP \(bu Xtry to avoid the national characters: X\(sh \(Do \(at \(lB \(rs \(rB \(ha \(oq \(lC \(ba \(rC \(ti X.IP \(bu Xuse characters associated with the graphical description of the symbol. XThus X.B \e(oA Xfor X.B \(oA Xinstead of \f3\e(AA\f1. X.IP \(bu XWhenever a character is a combination of an accent and a letter Xuse as name a character representing the accent followed by the letter. XFor example: \f3\e('e\f1 for \('e. XThis is the way these combinations were made on old-fashioned Xtypewriters: Xfirst hit the dead key with the accent then the key with the letter. XThe characters we used for the accents (and the like) are: X.br X.sp 0.6 X.TS Xallbox,center; Xl c c c c c c c c c c c c c. XASCII \(aa \(ga : , \(ti o \(ha . " u v - / XAccent \(aa \(ga \(ad \(ac \(ti \(de \(ha . \(a" \(ab \(ah \(hy \(sl XName aa ga ad ac ti de ha a. a" ab ah hy sl X.TE X.IP \(bu Xfractions are done in the natural way: \f3\e(\f2nm\f1 Xfor \s-3\v'-0.45m'n\v'0.45m'\s0/\s-3m\s0. X.IP \(bu XCurrency signs have two letter names. XThe first letter is Capital, the second small. X.IP \(bu XThe names for accents start with an \f3a\fP, for example X.B \e(ad Xfor \(ad. X.NH XThe weird characters one has X.LP XSuppose you have a X.I flat, Xwhich is unlikely on any other device, but Xstill want to have your input send to your friend/editor which can run Xit on his troff. XThe first thing to do is not using the character as it is Ximplemented on your troff (f.i. \e(ft), but define a Xstring at the beginning of the text and use that. XThe next thing to do is to give a description how it should look like. XSo if we stick with our flat, you end up at the start of the your Xinput file with: X.DS X.ft CW X .\e" We use a "flat" (\e(ft) a lot instead of a backslash, because it X .\e" stands out nicely and, since a backslash can be interpreted in X .\e" so many ways, I want to make clear that you see an escape when I X .\e" talk about it.... X .\e" So I define the string ft X .ds ft \e(ft X .\e" A "flat" is what in music stands for lowering the current X .\e" tone. If that not is clear, consider a | (pipe character) with X .\e" an small circle attached to it on the left bottom side. X .\e" A define in the style of X .\e" .ds \eo'|o' X .\e" will do if you don't have anything more than a lineprinter around. X .\e" If you want to be really fancy, you might want to try X .\e" something like: X .\e" .nr x \ew'o'/2u X .\e" .nr y .2m X .\e" .ds ft \ev'-\enyu'\ez|\eh'\enxu'\ev'\enyu'\eS'-9'o\eS'0' X .\e" Fancy ain't it? X.DE X.ft 1 XSo the rest of your article will look (at the input side) as: X.DS X.ft CW X An escape (denoted by \e*(ft) in troff will introduce a two X character name XX by \e*(ft( (so \e*(ft(XX) and a two character X named QQ string interpolation will be triggered X by \e*(ft*( (\e*(ft*(QQ). X.DE X.NH XThe future X.PP XNew versions of X.I troff Xinclude a more forms of naming characters. XBoth \f3\eC\'\f2arbitray\ long\ name\f3\'\f1 and X\f3\eN\'\f2absolute\ Number\ of\ character\ in\ current\ font\f3\'\f1 Xare used. XThe first allows character names of arbitrary length. XThe latter allows unnnamed characters. XThis does not solve the problems discussed in this article. XWorse, it will even be harder to choose names upon which Xa sizeable number of people will agree. X.NH XBibliography X.LP X.[ X$LIST$ X.] X.bp X.NH XThe character set X.LP X.so specchar + END-OF-FILE extchar chmod 'u=r,g=r,o=r' 'extchar' set `wc -c 'extchar'` count=$1 case $count in 12207) :;; *) echo 'Bad character count in ''extchar' >&2 echo 'Count should be 12207' >&2 esac echo Extracting 'spec' sed 's/^X//' > 'spec' << '+ END-OF-FILE ''spec' XWe divided the new characters in four categories. XEach character is mentioned only once. XWhenever we doubted we tried to place a character in the category we thought was Xmost suitable. Lastly we included the characters from Ossanna's Xtroff document for reference. X.NH 2 XSymbols from ISO 646 standards X.LP X.TS Xsh sharp XYe Yen XCs Currency sign XDo Dollar XPo English pound Xat at sign XlB left square bracket Xrs backslash XrB right square bracket Xha hat or accent circumflex XlC left curly bracket XrC right curly bracket Xba bar (possibly broken) Xti tilde Xa" accent double quote XAE AE X/O O slash XoA A circle Xae ae X/o o slash Xoa a circle X:A A diaeresis X:O O diaeresis X:U U diaeresis X'e e acute accent X:a a diaeresis X:o o diaeresis X:u u diaeresis Xss German ringel S X`a a grave accent X,c c cedilla X`e e grave accent X`u u grave accent X`o o grave accent X`i i grave accent Xr! reverse ! X~N N tilde Xr? reverse ? X~n n tilde X~a a tilde X^a a hat X,C C cedilla X^e e hat X^i i hat X~o o tilde X^u u hat X'E \*('E E acute accent X'A \*('A A acute accent X'a a acute accent X'i i acute accent X'c \*('c c acute accent X'C \*('C C acute accent X~O \*(~O O tilde X~A \*(~A A tilde X-d d bar X-D Capital Icelandic Eth (D) / D bar X.TE X.ne 10 X.NH 2 XSymbols from ISO 8859 standards X.LP X.TS XSd small Icelandic eth (d) X/l Polish l X/L Polish L XTp Small Icelandic Thorn XTP Capital Icelandic Thorn Xbb broken bar XS1 \h'0.9n'\v'-1n'\s-3\&1\s0\v'1n' superscript 1 XS2 \h'0.9n'\v'-1n'\s-3\&2\s0\v'1n' superscript 2 XS3 \h'0.9n'\v'-1n'\s-3\&3\s0\v'1n' superscript 3 X:e e diaeresis X`E E grave accent X'o o acute accent X^o o hat X'u u acute accent XOf feminin ordinator indicator XOm masculin ordinator indicator XFo double french open quote XFc double french close quote Xac accent cedilla Xad accent diaeresis Xps english paragraph sign X:i i diaeresis X12 one half X14 one quart X34 three quart Xmd centered dot Xno not X.TE X.ne 10 X.NH 2 XTypographical characters X.LP X.TS Xab accent breve Xao accent corona Xah accent ha\o'c\(ah'ek Xa. accent dot Xho hook X.i dotless i X.j dotless j Xfo french open quote Xfc french close quote XIJ IJ ligature IJ Xij ligature ij Xtm Trade Mark Xoq open quote Xoe French oe XOE French OE XOK check mark X.TE X.ne 20 X.NH 2 XMathematical characters X.LP X.TS X%0 per mille X-h h bar Xsd second sign Xc+ circle plus Xc* circle times X>~ approximately greater X<~ approximately less X<< much less X>> much greater X=~ approximately equal XOR logical or XAN logical and Xfa for all Xte there exists X3d therefore Xpp perpendicular to X/_ angle X!< not less X!> not greater Xnm not a member Xcn contains Xnc does not contain X~~ approximately XAh Aleph Xne not equivalent X-+ minus plus X.TE X.NH 2 XSymbols from the Troff manual by Ossanna. X.LP X.TS Xem 3/4 Em dash Xhy hyphen Xbu bullet Xsq square Xru rule X14 1/4 X12 1/2 X34 3/4 Xfi fi Xfl fl XFi ffi Xff ff XFl ffl Xde degree Xdg dagger Xfm foot mark Xct cent sign Xrg registered Xco copyright Xpl math plus Xmi math minus Xeq math equals X** math star Xsc section Xaa acute accent Xga grave accent Xul underline Xsl slash (matching backslash) X*a alpha X*b beta X*g gamma X*d delta X*e epsilon X*z zeta X*y eta X*h theta X*i iota X*k kappa X*l lambda X*m mu X*n nu X*c xi X*o omricron X*p pi X*r rho X*s sigma Xts terminal sigma X*t tau X*u upsilon X*f phi X*x chi X*q psi X*w omega X*A Alpha X*B Beta X*G Gamma X*D Delta X*E Epsilon X*Z Zeta X*Y Eta X*H Theta X*I Iota X*K Kappa X*L Lambda X*M Mu X*N Nu X*C Xi X*O Omricron X*P Pi X*R Rho X*S Sigma X*T Tau X*U Upsilon X*F Phi X*X Chi X*Q Psi X*W Omega Xsr square root Xrn root en extender X>= >= X<= <= X== identical equal X~= approx = Xap approximates X!= not equal X-> right arrow X<- left arrow Xua up arrow Xda down arrow Xmu multiply Xdi divide X+- plus-minus Xcu cup (union) Xca cap (intersection) Xsb subset of Xsp superset of Xib improper subset Xip improper superset Xif infinity Xpd partial deriative Xgr gradient Xnp not Xis integral sign Xpt proportional to Xes empty set Xmo member of Xbr box vertical rule Xdd double dagger Xrh right hand Xlh left hand Xbs Bell System logo Xor or Xci circle Xlt left top of big curly bracket Xlb left bottom Xrt right top Xrb right bottom Xlk left center Xrk right center Xbv bold verticel Xlf left floor Xrf right floor Xlc left ceiling Xrc right ceiling X.TE + END-OF-FILE spec chmod 'u=r,g=r,o=r' 'spec' set `wc -c 'spec'` count=$1 case $count in 4339) :;; *) echo 'Bad character count in ''spec' >&2 echo 'Count should be 4339' >&2 esac echo Extracting 'nicetr' sed 's/^X//' > 'nicetr' << '+ END-OF-FILE ''nicetr' Xawk 'BEGIN { X print ".tr '"'"'\\'"'"'`\\`" X print ".vs 12p" X print ".ps 10" X FS=" " X} X/^.TS/ { X print ".TS H" X print "lw(1c) lw(1c) lw(5c) lw(1c) lw(1c) lw(5c)." X print "\\f3Char\\fP\t\\f3Name\\fP\t\t\\f3Char\\fP\t\\f3Name\\fP" X print "" X print ".TH" X tabling=1 X no_entries=0 X} X/^.TE/ { X tabling=0 X if ( no_entries%2==1 ) printf "\n" X} Xtabling==1 && $1!=".TS" { X no_entries +=1 X if ( $2=="" ) { X printf "\\(%s\t\\e(%s\t%s", $1, $1, $3 X } else { X printf "%s\t\\e(%s\t%s", $2, $1, $3 X } X if ( no_entries%2==0 ) printf "\n" X else printf "\t" X} Xtabling==0 XEND { Xprint ".tr '"''"'``" Xprint ".ps 10" X} X' $* | tbl + END-OF-FILE nicetr chmod 'u=r,g=r,o=r' 'nicetr' set `wc -c 'nicetr'` count=$1 case $count in 613) :;; *) echo 'Bad character count in ''nicetr' >&2 echo 'Count should be 613' >&2 esac echo Extracting 'tmac.la' sed 's/^X//' > 'tmac.la' << '+ END-OF-FILE ''tmac.la' X.de la X.\" languages - keld@dkuug.dk & storm@dkuug.dk X.\" Covers all ECMA registrered versions of ISO 646 X.\" Countries according to ISO 3166 X.\" Commented is registered ECMA char code and standard number X.fl X.ie \\n(.$=0 .ds )L \\*(=L X.el .ds )L \\$1 X.ds =L \\*(LA X.ds LA \\*()L X.rm )L X.if "\\*(LA"DK" .tr #\(sh$\(Do@\(at[\(AE\\\\\(/O]\(oA^\(ha\`\(ga{\(ae|\(/o}\(oa~\(ti \" DS 2089 X.if "\\*(LA"US" .tr #\(sh$\(Do@\(at[\(lB\\\\\(rs]\(rB^\(ha\`\(ga{\(lC|\(ba}\(rC~\(ti \" B X3.4-1968 X.if "\\*(LA"ISO" .tr #\(sh$\(Cs@\(at[\(lB\\\\\(rs]\(rB^\(ha\`\(ga{\(lC|\(ba}\(rC~\(rn \" @ IRV X.if "\\*(LA"GB" .tr #\(Po$\(Do@\(at[\(lB\\\\\(rs]\(rB^\(ha\`\(ga{\(lC|\(ba}\(rC~\(rn \" A BS 4730 X.if "\\*(LA"DE" .tr #\(sh$\(Do@\(sc[\(:A\\\\\(:O]\(:U^\(ha\`\(ga{\(:a|\(:o}\(:u~\(ss \" K DIN 66 003 X.if "\\*(LA"FR" .tr #\(Po$\(Do@\(`a[\(de\\\\\(,c]\(sc^\(ha\`\(mu{\('e|\(`u}\(`e~\(ad \" f NF Z 62-010 (1982) X.if "\\*(LA"CN" .tr #\(sh$\(Ye@\(at[\(lB\\\\\(rs]\(rB^\(ha\`\(ga{\(lC|\(ba}\(rC~\(rn \" T GB 1988-80 X.if "\\*(LA"JP" .tr #\(sh$\(Do@\(at[\(lB\\\\\(Ye]\(rB^\(ha\`\(ga{\(lC|\(ba}\(rC~\(rn \" n JIS C 6229-1984 X.if "\\*(LA"IT" .tr #\(Po$\(Do@\(sc[\(de\\\\\(,c]\('e^\(ha\`\(`u{\(`a|\(`o}\(`e~\(`i \" Y X.if "\\*(LA"ES" .tr #\(Po$\(Do@\(sc[\(r!\\\\\(~N]\(r?^\(ha\`\(ga{\(de|\(~n}\(,c~\(ti \" Z X.if "\\*(LA"ES2" .tr #\(sh$\(Do@\(bu[\(r!\\\\\(~N]\(,C^\(r?\`\(ga{\(aa|\(~n}\(,c~\(ad \" h X.if "\\*(LA"PT" .tr #\(sh$\(Do@\(sc[\(~A\\\\\(,C]\(~O^\(ha\`\(ga{\(~a|\(,c}\(~o~\(de \" L X.if "\\*(LA"PT2" .tr #\(sh$\(Do@\(aa[\(~A\\\\\(,C]\(~O^\(ha\`\(ga{\(~a|\(,c}\(~o~\(ti \" g X.if "\\*(LA"HU" .tr #\(sh$\(Cs@\('A[\('E\\\\\(:O]\(:U^\(ha\`\('a{\('e|\(:o}\(:u~\(a" \" i MSZ 7795/3 X.if "\\*(LA"NO" .tr #\(sh$\(Do@\(at[\(AE\\\\\(/O]\(oA^\(ha\`\(ga{\(ae|\(/o}\(oa~\(rn \" ` NS 4551 - 1 X.if "\\*(LA"NO2" .tr #\(sc$\(Do@\(at[\(AE\\\\\(/O]\(oA^\(ha\`\(ga{\(ae|\(/o}\(oa~\(bv \" a NS 4551 - 2 X.if "\\*(LA"SE" .tr #\(sh$\(Cs@\(at[\(:A\\\\\(:O]\(oA^\(ha\`\(ga{\(:a|\(:o}\(oa~\(rn \" G SEN 850200 B X.if "\\*(LA"SE2" .tr #\(sh$\(Cs@\('E[\(:A\\\\\(:O]\(oA^\(:U\`\('e{\(:a|\(:o}\(oa~\(:u \" H SEN 850200 C X.if "\\*(LA"FI" .tr #\(sh$\(Do@\(at[\(:A\\\\\(:O]\(oA^\(ha\`\(ga{\(:a|\(:o}\(oa~\(ti X.if "\\*(LA"JS" .tr #\(sh$\(Do@\(vZ[\(vS\\\\\(-D]\('C^\(vC\`\(vz{\(vs|\(-d}\(vc~\('c \" z JUS I.B1.002 X.. + END-OF-FILE tmac.la chmod 'u=r,g=r,o=r' 'tmac.la' set `wc -c 'tmac.la'` count=$1 case $count in 2287) :;; *) echo 'Bad character count in ''tmac.la' >&2 echo 'Count should be 2287' >&2 esac echo Extracting 'pchdefs' sed 's/^X//' > 'pchdefs' << '+ END-OF-FILE ''pchdefs' X.\" The following characters are not defined on the Agfa P400 printer X.ds 'E E\h'-\w"E"u-\w"\(aa"u/2u'\v'-.19m'\(aa\v'.19m' X.ds 'A A\h'-\w"A"u-\w"\(aa"u/2u'\v'-.19m'\(aa\v'.19m' X.ds ~A A\h'-\w"A"u-\w"\(ti"u/2u'\v'-.19m'\(ti\v'.19m' X.ds ~O O\h'-\w"O"u-\w"\(ti"u/2u'\v'-.19m'\(ti\v'.19m' X.ds 'C C\h'-\w"C"u-\w"\(aa"u/2u'\v'-.19m'\(aa\v'.19m' X.ds 'c c\h'-\w"c"u-\w"\(aa"u/2u'\v'-.19m'\(aa\v'.19m' X.\" .ds vS S\h'-\w"S"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m' X.\" .ds vs s\h'-\w"s"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m' X.\" .ds vc c\h'-\w"c"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m' X.\" .ds vC C\h'-\w"C"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m' X.\" .ds vZ Z\h'-\w"Z"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m' X.\" .ds vz z\h'-\w"z"u-\w"\(ah"u/2u'\v'-.19m'\(ah\v'.19m' X.\" .ds -D D\h'-\w"D"u-\w"\(hy"u/2u'\v'-.19m'\(hy\v'.19m' X.\" .ds -d d\h'-\w"d"u-\w"\(hy"u/2u'\v'-.19m'\(hy\v'.19m' + END-OF-FILE pchdefs chmod 'u=r,g=r,o=r' 'pchdefs' set `wc -c 'pchdefs'` count=$1 case $count in 858) :;; *) echo 'Bad character count in ''pchdefs' >&2 echo 'Count should be 858' >&2 esac echo Extracting 'typesetting' sed 's/^X//' > 'typesetting' << '+ END-OF-FILE ''typesetting' X%T Adventures with Typesetter\-Independent TROFF X%A Mark Kahrs X%A Lee Moore X%R TR 159 X%C Rochester, NY X%D June, 1985 X%I University of Rochester X X%A J. F. Ossanna X%T NROFF/TROFF User's Manual X%D October 1976 X%R Comp. Sci. Tech. Rep. 54 X%I Bell Laboratories X%C Murray Hill, NJ X X%A B.W. Kernighan X%T A typesetter-independent TROFF X%D March 1982 X%R Comp. Sci. Tech. Rep. 97 X%I Bell Laboratories X%C Murray Hill, NJ X X%A J. Akkerhuis X%T Unknown X%J Proceedings of the European Unix\(tm System User Group Autumn Meeting X%D 7-9 september 1981 + END-OF-FILE typesetting chmod 'u=rw,g=r,o=r' 'typesetting' set `wc -c 'typesetting'` count=$1 case $count in 533) :;; *) echo 'Bad character count in ''typesetting' >&2 echo 'Count should be 533' >&2 esac exit 0
hansen@pegasus.att.com (Tony L. Hansen) (12/27/90)
< From: heimir@rhi.hi.is (Heimir Thor Sverrisson) << There are some methods of data interchange, such as most email systems, << that are inherently 7-bit. It would be nice if we could just banish << them, but compatibility is an albatross. < There is nothing that tells me that email systems should be _inherently < 7-bit_. In fact here in Iceland we have to hack almost every piece of < communications software to be able to use it in our _inherently 8-bit < environment_. This < < I can see no reason at all for some stupid mailers to strip off the < eighth bit (including Interactive's version of sendmail). They don't < have to - and should not - interpret the contents of the mail they are < transmitting. This is quite different from the troff situation where a < program has to know a lot about it's input set. So why don't you guys < simply open up your mailers and be ready with a 8-bit clean version by < the end of 1991! 1991? Why not now? The System V release 4.0 mail program is completely 8-bit clean! (If you can find anyplace within SVr4 mail which isn't, I'll personally guarantee that the next version of mail which comes from UNIX System Laboratories will have a fix for the problem.) By the way, the SMTP protocol doesn't permit 8-bit data. This limits mailers which must send mail using that protocol. Tony Hansen att!pegasus!hansen, attmail!tony hansen@pegasus.att.com
keld@login.dkuug.dk (Keld J|rn Simonsen) (12/28/90)
hansen@pegasus.att.com (Tony L. Hansen) writes: >< From: heimir@rhi.hi.is (Heimir Thor Sverrisson) >< I can see no reason at all for some stupid mailers to strip off the >< eighth bit (including Interactive's version of sendmail). They don't >< have to - and should not - interpret the contents of the mail they are >< transmitting. This is quite different from the troff situation where a >< program has to know a lot about it's input set. So why don't you guys >< simply open up your mailers and be ready with a 8-bit clean version by >< the end of 1991! >1991? Why not now? The System V release 4.0 mail program is completely 8-bit >clean! (If you can find anyplace within SVr4 mail which isn't, I'll >personally guarantee that the next version of mail which comes from UNIX >System Laboratories will have a fix for the problem.) I have also done patches to sendmail 5.61 and 5.64 to do 8-bit mail. The patches require that you also have IDA installed. Actually it handles quite some different 8-bit character sets like 8859-1 8859-2 and the rest of the 8859 family and vendor character sets like the IBM codepages. In all about 60 character sets are supported in the current release. >By the way, the SMTP protocol doesn't permit 8-bit data. This limits mailers >which must send mail using that protocol. My package also has provisions for sending 8-bit mail thru 7-bit SMTP in an "encoded ASCII" mode. My package is avaliable in dkuug.dk:pub/sm.8+bit.pa sm5.64.8+bit.pa and ch.shar by anon ftp. By mail you can get it by mailing mail uunet!dkuug.dk!archive Subject: files pub Names: sm.8+bit.pa sm5.64.8+bit.pa ch.shar Its about 100 kb. Enjoy! Keld Simonsen
tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) (12/28/90)
hansen@pegasus.att.com (Tony L. Hansen) writes: > > By the way, the SMTP protocol doesn't permit 8-bit data. This limits > mailers which must send mail using that protocol. True. But there is no technical reason (other than short-sightedness) why SMTP has to strip off the 8th (high) bit. There are in fact working versions of sendmail that don't disturb the 8th bit.
bruce@balilly.UUCP (Bruce Lilly) (12/28/90)
In article <1990Dec27.043500.27639@cbnewsk.att.com> hansen@pegasus.att.com (Tony L. Hansen) writes: >< From: heimir@rhi.hi.is (Heimir Thor Sverrisson) > ><< There are some methods of data interchange, such as most email systems, ><< that are inherently 7-bit. It would be nice if we could just banish ><< them, but compatibility is an albatross. > >< So why don't you guys >< simply open up your mailers and be ready with a 8-bit clean version by >< the end of 1991! > >1991? Why not now? The System V release 4.0 mail program is completely 8-bit >clean! (If you can find anyplace within SVr4 mail which isn't, I'll >personally guarantee that the next version of mail which comes from UNIX >System Laboratories will have a fix for the problem.) Is AT&T SVR4 available for the AT&T 3B1 and/or 7300 UNIX(TM)PC? ^^^^ ^^^^ No? You've got to be kidding. Reality check: it is impossible for everybody everywhere to upgrade (if that's the right term (it might be)) to version N of any software. Even if economic limitations are ignored, other considerations, such as inertia, inadequate disk space, compatibility requirements with other software and (ahem :-) lack of vendor support prevent many people from upgrading. Of course, if you can make a port of SVR4 available at a reasonable price for the 3B1, I'll upgrade. (i.e. put up or shut up) Don't take this too seriously -- I'm really a big fan of AT&T. I just get somewhat aggravated when I'm told (by AT&T) that I can't purchase (at any price) AT&T software to run on my AT&T hardware (like the time I asked if WWB was available for the 3B1). > Tony Hansen > att!pegasus!hansen, attmail!tony > hansen@pegasus.att.com -- Bruce Lilly blilly!balilly!bruce@sonyd1.Broadcast.Sony.COM
hansen@pegasus.att.com (Tony L. Hansen) (12/31/90)
<< From: hansen@pegasus.att.com (Tony L. Hansen) << By the way, the SMTP protocol doesn't permit 8-bit data. This limits << mailers which must send mail using that protocol. < From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) < True. But there is no technical reason (other than short-sightedness) < why SMTP has to strip off the 8th (high) bit. There are in fact < working versions of sendmail that don't disturb the 8th bit. I agree completely, there is no reason to limit SMTP to 7-bits. Unfortunately, the standard currently REQUIRES the stripping and doing anything else is non-standard. I would definitely support changing the standard to allow an arbitrary 8-bit byte stream. This would also require eliminating the limitation of 1024-byte lines and anything else in the standard which is not content transparent. System V release 4 mail is completely content transparent. As long as the transport media is capable of handling the mail, SVr4 mail will be able to get it to you unchanged. Unfortunately, it can't do so over SMTP connections. Since this discussion is going somewhat away from the bounds of comp.text, I've added comp.mail.misc to the Newsgroup list. Tony Hansen att!pegasus!hansen, attmail!tony hansen@pegasus.att.com
barmar@think.com (Barry Margolin) (12/31/90)
In article <1990Dec31.004055.10335@cbnewsk.att.com> hansen@pegasus.att.com (Tony L. Hansen) writes: >System V release 4 mail is completely content transparent. As long as the >transport media is capable of handling the mail, SVr4 mail will be able to >get it to you unchanged. What does it do when sending textual mail to a system that doesn't use ASCII encoding, e.g. an IBM mainframe, or to a system with a different newline convention (e.g. CRLF rather than LF)? SMTP places restrictions on the characters that may appear in a message to support automated translation during the transfer process. -- Barry Margolin, Thinking Machines Corp. barmar@think.com {uunet,harvard}!think!barmar
keld@login.dkuug.dk (Keld J|rn Simonsen) (01/02/91)
hansen@pegasus.att.com (Tony L. Hansen) writes: >< From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) >< True. But there is no technical reason (other than short-sightedness) >< why SMTP has to strip off the 8th (high) bit. There are in fact >< working versions of sendmail that don't disturb the 8th bit. This introduces a problem with "embedded slashes" which are now represented internally in Sendmail with the 8th bit set. Have anybody got Sendmail patches to remedy this? >I agree completely, there is no reason to limit SMTP to 7-bits. >Unfortunately, the standard currently REQUIRES the stripping and doing >anything else is non-standard. I would definitely support changing the >standard to allow an arbitrary 8-bit byte stream. This would also require >eliminating the limitation of 1024-byte lines and anything else in the >standard which is not content transparent. I am much in favour of extending the character set supported by SMTP. But you should be careful. What is the meaning of a 8-bit character? Well, depends on the character set employed. Today we know that only 7-bit ASCII is allowed. But with 8-bit mail, is this octal code 0162 coming over the line an "small a with acute accent" (as in ISO 8859-1:1987), a Cent sign (as in IBM CP 437) or a "capital A with circumflex" (as in HP Roman8)? This might become a real problem given the current shares on the UNIX market. Just displaying the 8bit data to a user may be very confusing. It may even do strange things to your terminal equipment if IBM Codepage character set is employed, as some of the characters here are in the upper control character sets of ISO 8859 and other vendors chararacter sets. Should one then just say "Use ISO 8859"? Well, what ISO 8859? There are several parts, latin 1, latin 2 (eastern Europe), Greek, Cyrillic, Arabic, Hebrew (among others). The abovementioned character 0162 has different meanings in these different character sets. ISO 8859-1 would be the natural choice (and is also specified in a recent RFC on encoding: header.) But is that fair? I think that is like inventing a new ASCII, only capable of serving one region of the world sufficiently - this time having Western Europe (EEC) and all of North and South America covered. We should do something that could cover the whole world. It is also quite hard to persuade your manufactures to change their implementation character set, and even worse for equipment you already have bought and installed. Some of this may even be running software with no 8-bit capabilities! I think it would be nice to be able to support all of these new and oldie systems, and I have done an implementation of Sendmail capable of supporting more than 60 character sets. It currently does not touch the headers, but only the mail body. For characters not in the current character set, it encodes this character with a mnemonic code, for example a' for the above mentioned "small a with acute". Thus even in ASCII you can get the message! The sendmail patches are available with anon ftp in dkuug.dk:pub/ch.shar and sm5.64.8+bit.pa (sm.8+bit.pa for 5.61). Its about 100 kb - the Sendmail patches itself is under 100 lines, the rest is the character set stuff. It has been running here at dkuug.dk since Feb 90. A new ISO standard is showing up: ISO 10646 (which just has been published as a Draft International Standard (DIS)). This covers all characters in the world, with very few exceptions. And the exceptions are planned to be included in a later issue. Actually Dan Oscarsson and I have been planning (mostly Dan) to do a SMTP implementation for Sendmail negotiation 10646 for transmission, and write an RFC for this character set negotiation. Keld Simonsen
jacob@gore.com (Jacob Gore) (01/02/91)
/ comp.text / keld@login.dkuug.dk (Keld J|rn Simonsen) / Jan 1, 1991 / > Should one then just say "Use ISO 8859"? Well, what ISO 8859? > There are several parts, latin 1, latin 2 (eastern Europe), > Greek, Cyrillic, Arabic, Hebrew (among others). The abovementioned > character 0162 has different meanings in these different character sets. Specify the character set in the header. For example, Char-Encoding: ISO-8559-Latin-1 Jacob -- Jacob Gore Jacob@Gore.Com boulder!gore!jacob
les@chinet.chi.il.us (Leslie Mikesell) (01/02/91)
In article <1990Dec31.013538.9473@Think.COM> barmar@think.com (Barry Margolin) writes: >In article <1990Dec31.004055.10335@cbnewsk.att.com> hansen@pegasus.att.com (Tony L. Hansen) writes: >>System V release 4 mail is completely content transparent. As long as the >>transport media is capable of handling the mail, SVr4 mail will be able to >>get it to you unchanged. >What does it do when sending textual mail to a system that doesn't use >ASCII encoding, e.g. an IBM mainframe, or to a system with a different >newline convention (e.g. CRLF rather than LF)? SMTP places restrictions on >the characters that may appear in a message to support automated >translation during the transfer process. But the automated translation can currently only work with text while many mailers are now capable of attaching arbitrary binary data to messages. Depending on the type of the content, a different transformation (or none) may be desired. Assuming that the non-textual portions are encapsulated with "Content-Type:" and "Content-Length:" headers, it would be easy for the transport to determine what, if any, transformation to use. In addition, an optional "Encoding-Method:" header can allow temporary transformations to meet the character set requirements of the transports. If the sending program had a way to determine the capabilities of the recipient, encoding could be done on-the-fly, using uuencode or atob, and thus only done where necessary (but I don't know of anyone actually doing this yet...). These issues are going to have to be addressed for messages originating on X.400 systems anyway, so why not try to do it efficiently by adding the equivalent functionality to SMTP/uucp mailers? Les Mikesell les@chinet.chi.il.us
tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) (01/03/91)
keld@login.dkuug.dk (Keld J|rn Simonsen) writes: > > Should one then just say "Use ISO 8859"? Well, what ISO 8859? > There are several parts, latin 1, latin 2 (eastern Europe), > Greek, Cyrillic, Arabic, Hebrew (among others)... > We should do something that could cover the whole world. This is what Unicode is for. Unicode should be considered the most useful and implementable subset of the draft standard ISO 10646. Unicode is an unambiguous fixed-length 16-bit global codeset currently under development by the Unicode Consortium. Unicode offers a uniform text and character standard that can encompass all living languages and form a long-lasting basis for worldwide data exchange. Unicode makes all 65,535 slots available, with these constraints: o The first 256 slots duplicate the arrangement of ASCII and ISO Latin-1. o Characters unique to a language are grouped together in standard order. o Letters, punctuation, symbols, and diacritics shared by multiple languages are grouped together. o Asian pictographs are grouped together in order of frequency (as specified by national standards), then sorted in traditional radical/stroke order. o Chinese, Japanese, and Korean phonetic symbols are grouped together by language in standard order. The reason 16 bits are enough is that Asian pictographs which everyone would recognize as the same have been unified. Thus, more than 31,000 characters have been reduced to about 20,000 slots. Major Han Character Standards Country Standard Year Characters ------- -------- ---- ---------- China GB 2312 1980 6,763 Japan JIS X0208 1983 6,349 Korea KS C5601 1987 4,888 Taiwan CNS 11643 1986 13,051 ------ total 31,051 In addition to East Asian languages, here are the writing systems currently available in Unicode: Greek, Cyrillic, Georgian, Armenian, Hebrew, Arabic, Ethiopian, Devanagari, Bengali, Gurmukhi, Gujarti, Oriya, Tamil, Telegu, Kannada, Malayalam, Sinhalese, Thai, Lao, Burmese, Khmer, Tibetan, and Mongolian.
keld@login.dkuug.dk (Keld J|rn Simonsen) (01/03/91)
tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) writes: >keld@login.dkuug.dk (Keld J|rn Simonsen) writes: >> >> Should one then just say "Use ISO 8859"? Well, what ISO 8859? >> There are several parts, latin 1, latin 2 (eastern Europe), >> Greek, Cyrillic, Arabic, Hebrew (among others)... >> We should do something that could cover the whole world. >This is what Unicode is for. Unicode should be considered the most >useful and implementable subset of the draft standard ISO 10646. Is UNICODE a true subset of ISO 10646? Is there a well defined relation between ISO 10646 encoding and UNICODE? Seasons greetings! Keld
tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) (01/04/91)
keld@login.dkuug.dk (Keld J|rn Simonsen) writes: > > Is UNICODE a true subset of ISO 10646? > Is there a well defined relation between ISO 10646 encoding and UNICODE? ISO 10646 is still in draft form. Both questions are impossible to answer until 10646 gets finalized. Disclaimer: I'm not an expert in this area. However, extrapolating from what I know, it appears that Unicode could be considered a 16-bit implementation of 10646. The ISO 10646 draft standard appears to permit 16-bit implementations of any subset thereof, for use in process code or communication. It just so happens that Unicode covers all Asian characters enumerated by existing national standards, plus characters from languages that the 10646 draft hasn't even thought about. So it may be a subset, but a largely complete subset. Lee Collins writes: > Notice that 10646 would require 93,816 separate codes to cover existing > [Chinese/Japanese/Korean] standards. Han Unification allows Unicode to > cover the same standards with only 18,739 unique characters. Ken Whistler writes: > Unicode 1.0 also includes the following scripts omitted from DIS 10646: > Ethiopian, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, > Malayalam, Sinhalese, and Lao. There have been attempts to convert Unicode to 10646 and back again, I believe with mostly good results. Of course, some data may be lost in the translation.