storm@texas.dk (Kim F. Storm) (07/24/90)
(I've cross-posted this to news.software.nn since it contains some information about future directions of nn. Followups are directed to .b only. ++Kim). In news.software.b, frisk@rhi.hi.is (Fridrik Skulason) writes: >Well - some of us have 8-bits news already - I am for example using an 8-bit >'rn' right now. nn release 6.4 supports presentation of 8-bit data (and with pl 6 it also accepts 8-bit command input). >I fully agree that we need an 8-bit news system (as well as 8-bit E-mail), >as this would make life a lot easier for those of us not using English. Certainly depends on who "we" are. I think that if we are going to attack this problem, we better do it properly from the start, and not *just* solve the 8-bit news problem (which you cannot solve anyway). >Modifying the news software to permit the transmission of 8-bit data is >trivial - the real problem is the charcter set issue. Changing any software is just *so easy*. But it is *impossible* to get people to install the changes unless they have a personal interest in doing so. I speak from experience: More than one year has gone since the initial release on nn worldwide (rel. 6.3.0). Since then, about 20 patches including a new release has been posted, but there are still some sites out there running 6.3.0 (or .1 or .2) which you can recognize from the RFC violating Re^2: prefixes in the Subject: lines on some postings. I still get complaints about how stupid nn is, although this problem is fixed (oh yes, it was *trivial*). But getting people to update.... So if you want this to work on a world-wide scale within a timeframe of less than 3-6 years I believe it must be done in the news reader software at the end-points which needs this, and design a transport protocol for 8-bit, 16-bit and even 32-bit data which can: a) be transparently sent through the current 7-bit restrictive channels, and b) still be interpreted sensible on systems and by news readers which are not adapted to this new scheme. Keld Simonsen from the Danish UNIX-system Users Group (DKUUG) has defined a new naming scheme for international characters based on 10646 using primarily two-character names which attempts to be as close to the real character as possible. For example, e' is an e with ' above it o: is o with two dots above, etc. Now this sounds rather trivial, but it has specifically been designed for the above purposes: (a) it uses only a subset of the ASCII character set (e.g. { and } are not used since they are used for national characters in many older 7-bit characters based on ISO 646), and (b) as the example shows, the character name is a close approximation to the actual character. So you can actually read a letter written using the character names! (Of course, "a" is named "a", "b" is "b", "A" is "A", etc.). A letter can then be written in any of the 8859/x variants, in various EBCDIC and other IBM codepages, etc. using N-bit codes supported on the local system. However, when such a letter is transmitted to a remote system, all the international characters are *encoded* by replacing all the international characters by an "escape character" followed by the (two character) character name. The result is a pure 7-bit letter. At the receiving end, the encoded letter can be converted back to the originating character set, or the character set used on the local character set as far as that is possible. But that is the choice of the recipient end! Or, it can just be read without conversion since the encoding is "readable" (the only problem being the escape character). In the sendmail used on the Danish DKnet backbone, Keld has implemented this and it is running very well, supporting about 50 different character sets. By default it uses ^] as the escape character which has the benefit of being invisible on most terminals, but it can use any escape character you like. Both the escape character and the originating character set is specified in the articles header. (more on this below) >Some possible solutions: >(1) Each machine posts articles using the user's character set of choice. >To indicate which character set is used, a new field is added to the header. > examples: Character-set: CP 870 > Character-set: ISO 8859/4 This is what Keld's sendmail extensions support today. >This is easy to implement, but has one serious drawback - all machines are >required to be able to handle all possible character sets. Not with Keld's solution: - If you know the character set, you can convert to it. - If you use another character set, you can convert to that instead since all international characters have been give *unique* names. - If your software doesn't understand any of it, you can still read the message with little or no problems. And in the sendmail case, the Danish backbone is actually doing the encoding *and* decoding for the Danish sites for which it has been told which character set they prefer. So if one site runs 8859/1, they send 8-bit 8859/1 data directly to the backbone, and if the recipient is a known EBCDIC site, the backbone converts the letter to EBCDIC before delivery! If it is to an unknown site, it will be converted to the "encoded" character set, and it is thus the task of the recipient to handle it. So in Denmark we not only run 8-bit mail, but *multi character set* mail. And it is transparent for all practical uses. >(2) On every machine the article is translated into one of the ISO 8859/x >character sets.... Too limited, and which one should you choose? >(3) All text is transmitted according to the ISO 10646 standard. This has >one advantage compared to (2) - it allows the transmission of documents >containing 16-bit characters, as well as documents containing characters from >more than one of the 8859/x standards. For example, one could send a message >with the first part in Russian and the second part in Greek. Currently, I think Keld has defined about 1000 characters *including* Greek, Russian (Cyrillic), Hebrew, Arabian, all 8859 sets, EBCDIC, PC character sets and more. And there are "hooks" reserved to include longer names for kanji characters and the like. So you can say that Keld has defined a 10646 character set representation using only a limited 7-bit character set. >My opinion is that (3) is more of a long-term goal - for 95 % of users of >Usenet, (2) is all that is needed. And if you want to keep it that way, sure limit yourself to (2). >But what changes would (2) require ? >Change #1: Any ASCII computer on Usenet must accept 8-bit news and E-mail, > and be able to forward articles without changes (in other words - > don't strip the eight bit !!!) This is the only change required > from the "English-only" ASCII-sites, where no 8-bit articles > would originate or be read. The "only" change, yes, but a change which you simply cannot expect to be done. No hope at all! >Change #2: Any computer on Usenet using an extended version of ASCII (CP 437, > ISO 8859/x etc) must translate all postings to one of the 8859/x > charcter sets and indicate (in the header) which one is used. This wouldn't do it if the recipient end cannot handle that character set. Or said in another way: which one of the 8859/x character sets should you use? 8859/x is probably *the* answer for use within a certain country on *most* UNIX boxes, but what about all the PC character sets, EBCDIC hosts etc. Don't you think a little more than 5% of the users are in that category? >Change #3: Any computer not using ASCII, but rather EBDIC (or something else), > must translate all postings to one of the 8859/x character sets, > instead of just translating to ASCII. If they have to translate, they can just as well translate into something which has a good chance of getting through the network - and 8859 doesn't have a chance there. >Change #4: Any computer must accept postings in one of the 8859/x character > sets and be able to translate them to the character set used > by each user. But what if I support 8859/1 and get an article written in 8859/7 (greek?) If we use your scheme, *all* the 8859/x sets must be accepted! >Problem #1: If the local character set is not able to represent all the > charactes in the original posting, they must be represented as > well as possible. For example - a 7-bit computer receiving a text > containing accented wovels might be expected just to drop the > accent marks. Which may in some cases completely change the meaning! >Problem #2: Different users - even on the same machine - have different > capabilities to display 8-bit text. For example, in Scandinavia > it is common for terminals to use a 7-bit character set, where > some of the characters (for example { [ ] } |) have been replaced > by non-ASCII characters. Other users in the same countries have > fully 8-bit terminals (for example PCs running an terminal > emulator). The computer must store incoming articles as they > arrive and the news/E-mail software must be updated to display > them according to the capabilities of each terminal, as indicated > by an environment variable. Exactly, and that is definitely easiest if everybody agrees on *one* common "carrier character set" (my suggested term for such a character set). >So - what now ? >Is there any interest in creating a "working group" to attack the problem ? >Any of the authors of rn, nn, elm or other news/e-mail software out there ? Yes, support for Keld's multi character set handling is planned for an upgrade to nn 6.4 later this year. We have been looking at what can be used as the escape character in news, and this is definitely a problem, since inews traditionally is very restrictive with respect to what it will pass through (^] is filtered out as most other control characters). But we believe we have found the right solution, which will pass through at least Bnews' inews, and is supposed to be *transparent* to most news interfaces: We use a double escape character consisting of a "space" followed by a "backspace". When output to a screen this will be invisible and most pagers will handle backspace properly (i.e. move the cursor back over the space). And we think it is very unlikely that this sequence will occur in normal postings (we see no purpose for it). And since only articles which have the proper header specifying that this is really an encoded article will be "decoded", the filters which encode the articles at the originating end can check that no such sequences exist in the original text. >We are of course willing to share our modifications to the programs, and with >a bit of work we should be able to have 8-bit news/email running in a few >months. nn users world-wide can soon exchange multi character news - other users can read it (without problems), and we will publish our code and specifications so that other interfaces can support it as well. >So - any volunteers ? Yes, but is there any interest in what we plan to do??? And will our "space-backspace" escape pass through Cnews, NNTP and other inews/relaynews/whatever implementations (without modification)? -- Kim F. Storm <storm@texas.dk> No news is good news, Texas Instruments A/S, Denmark but nn is better!
ed@braaten.doit.sub.org (Ed Braaten) (08/05/90)
VERKADE@CTSS.CO.UK (Herman Verkade) writes: >A couple of comments on 8 bit news. It seems to me that it is not necesary to >convert the whole net to 8 bit. The 7 bit restriction is only a problem for >specific newsgroups: newsgroups in languages other than english and newsgroups >containing binary data, such as bitmaps, .gif files, etc. So, I don't think >**everybody** needs to upgrade to some implementation that supports 8 bits. >Only those that wish to carry newsgroups, that need it. All we would need >is a standard, not necesarily a world-wide upgrade of software. I think this is the right approach to the problem. If it works, don't fix it! ;-) But give the non-English and binary people a chance. A standard, however is an absolute must. >My proposal would be RFC-1154-style, because it also allows one message to >contain encodings in different parts and could therefore also be used to >automaticaly convert different parts of a message in 7 bit groups. For >example, a message containing a uuencoded file preceded by some explanation >in ASCII and a signature at the bottom, could have a header such as: > Encoding: 10 text, 1045 uuencode, 5 text >A smart news reader could display the two text parts and ask whether you want >the uuencode bit to be uudecoded. For an article containing a header like: > Encoding: 15 text, 637 uugif, 5 text >the reader could then automatically extract the uuencoded .gif file and >display an image instead. Etc, etc, etc. And only users that want such >functionality switch to a news reader that supports it. How about it? Could we get the author of nn sold on this? (I'm crossposting this article to n.s.nn to find out...) >I realise that I am discussing two seperate topics here: >1) Provide 8 bit transport mechanisms so that international character sets > can be used, but enable 8 bits only on a newsgroup by newsgroup basis > with either a designated character set for such a group, or an Encoding > header to indicate the character set. >2) An Encoding: header for carrying data other that text (in either 7 or > 8 bit groups). I like your suggestions Herman. What about the rest of the net? Opinions? Comments? Greetings from Munich, Ed --------------------------------------------------------------------------- Ed Braaten | Jesus answered, "I am the way and the Work: ed@imuse.de.intel.com | truth and the life. No one comes to the Home: ed@braaten.doit.sub.org | Father except through me." John 14:6 ---------------------------------------------------------------------------
mcmahon@tgv.com (John McMahon) (08/06/90)
In article <3863a@braaten.doit.sub.org>, ed@braaten.doit.sub.org (Ed Braaten) writes... >>My proposal would be RFC-1154-style, because it also allows one message to >>contain encodings in different parts and could therefore also be used to >>automaticaly convert different parts of a message in 7 bit groups. For >>example, a message containing a uuencoded file preceded by some explanation >>in ASCII and a signature at the bottom, could have a header such as: > >> Encoding: 10 text, 1045 uuencode, 5 text My understanding is that an RFC is in the works for "non-textual tranmission of data via E-mail". I suspect this could be easily expanded to include USENET NEWS. Watch the NIC for announcements of new RFCs... John 'Fast-Eddie' McMahon : MCMAHON@TGV.COM : TTTTTTTTTTTTTTTTTTTTTTTT TGV, Incorporated : : T GGGGGGG V V 603 Mission Street : HAVK (abha) Gur bayl : T G V V Santa Cruz, California 95060 : bcrengvat flfgrz gb : T G GGGG V V 408-427-4366 or 800-TGV-3440 : or qrfgeblrq ol znvy : T GGGGGGG V