[news.software.nn] Time for 8 bit news, isn't it?????.

storm@texas.dk (Kim F. Storm) (07/24/90)

(I've cross-posted this to news.software.nn since it contains some information
about future directions of nn.  Followups are directed to .b only.  ++Kim).

In news.software.b, frisk@rhi.hi.is (Fridrik Skulason) writes:

>Well - some of us have 8-bits news already - I am for example using an 8-bit
>'rn' right now.

nn release 6.4 supports presentation of 8-bit data (and with pl 6 it
also accepts 8-bit command input).

>I fully agree that we need an 8-bit news system (as well as 8-bit E-mail),
>as this would make life a lot easier for those of us not using English.

Certainly depends on who "we" are.  I think that if we are going to attack
this problem, we better do it properly from the start, and not *just* solve
the 8-bit news problem (which you cannot solve anyway).

>Modifying the news software to permit the transmission of 8-bit data is
>trivial - the real problem is the charcter set issue.

Changing any software is just *so easy*.  But it is *impossible* to get
people to install the changes unless they have a personal interest in
doing so.

I speak from experience:  More than one year has gone since the initial
release on nn worldwide (rel. 6.3.0).  Since then, about 20 patches
including a new release has been posted, but there are still some sites
out there running 6.3.0 (or .1 or .2) which you can recognize from the
RFC violating Re^2: prefixes in the Subject: lines on some postings.

I still get complaints about how stupid nn is, although this problem is
fixed (oh yes, it was *trivial*).  But getting people to update....

So if you want this to work on a world-wide scale within a timeframe of
less than 3-6 years I believe it must be done in the news reader software
at the end-points which needs this, and design a transport protocol for
8-bit, 16-bit and even 32-bit data which can:

a) be transparently sent through the current 7-bit restrictive channels,
   and

b) still be interpreted sensible on systems and by news readers which
   are not adapted to this new scheme.

Keld Simonsen from the Danish UNIX-system Users Group (DKUUG) has
defined a new naming scheme for international characters based on 10646 
using primarily two-character names which attempts to be as close to
the real character as possible.  For example, e' is an e with ' above it
o: is o with two dots above, etc.

Now this sounds rather trivial, but it has specifically been designed
for the above purposes:  (a) it uses only a subset of the ASCII character
set (e.g. { and } are not used since they are used for national characters
in many older 7-bit characters based on ISO 646), and (b) as the example
shows, the character name is a close approximation to the actual character.
So you can actually read a letter written using the character names!
(Of course, "a" is named "a", "b" is "b", "A" is "A", etc.).

A letter can then be written in any of the 8859/x variants, in various
EBCDIC and other IBM codepages, etc. using N-bit codes supported on the
local system.  However, when such a letter is transmitted to a remote
system, all the international characters are *encoded* by replacing all
the international characters by an "escape character" followed by the
(two character) character name.  The result is a pure 7-bit letter.

At the receiving end, the encoded letter can be converted back
to the originating character set, or the character set used on the local
character set as far as that is possible.  But that is the choice of
the recipient end!  Or, it can just be read without conversion since
the encoding is "readable" (the only problem being the escape character).

In the sendmail used on the Danish DKnet backbone, Keld has implemented
this and it is running very well, supporting about 50 different
character sets.  By default it uses ^] as the escape character which has the
benefit of being invisible on most terminals, but it can use any escape
character you like.  Both the escape character and the originating
character set is specified in the articles header.  (more on this below)

>Some possible solutions:

>(1)  Each machine posts articles using the user's character set of choice.
>To indicate which character set is used, a new field is added to the header.

>                 examples:     Character-set: CP 870
>	                       Character-set: ISO 8859/4

This is what Keld's sendmail extensions support today.

>This is easy to implement, but has one serious drawback - all machines are
>required to be able to handle all possible character sets.

Not with Keld's solution:
- If you know the character set, you can convert to it.

- If you use another character set, you can convert to that instead since
all international characters have been give *unique* names.

- If your software doesn't understand any of it, you can still read the
message with little or no problems.

And in the sendmail case, the Danish backbone is actually doing the
encoding *and* decoding for the Danish sites for which it has been told
which character set they prefer.  So if one site runs 8859/1, they send
8-bit 8859/1 data directly to the backbone, and if the recipient is a known
EBCDIC site, the backbone converts the letter to EBCDIC before delivery!
If it is to an unknown site, it will be converted to the "encoded" character
set, and it is thus the task of the recipient to handle it.

So in Denmark we not only run 8-bit mail, but *multi character set* mail.
And it is transparent for all practical uses.

>(2)  On every machine the article is translated into one of the ISO 8859/x
>character sets....

Too limited, and which one should you choose?

>(3)  All text is transmitted according to the ISO 10646 standard.  This has
>one advantage compared to (2) - it allows the transmission of documents
>containing 16-bit characters, as well as documents containing characters from
>more than one of the 8859/x standards.  For example, one could send a message
>with the first part in Russian and the second part in Greek.

Currently, I think Keld has defined about 1000 characters *including* Greek,
Russian (Cyrillic), Hebrew, Arabian, all 8859 sets, EBCDIC, PC character sets
and more.  And there are "hooks" reserved to include longer names for
kanji characters and the like.  So you can say that Keld has defined a
10646 character set representation using only a limited 7-bit character set.

>My opinion is that (3) is more of a long-term goal - for 95 % of users of
>Usenet, (2) is all that is needed.

And if you want to keep it that way, sure limit yourself to (2).

>But what changes would (2) require ?

>Change #1:  Any ASCII computer on Usenet must accept 8-bit news and E-mail,
>            and be able to forward articles without changes (in other words - 
>            don't strip the eight bit !!!)  This is the only change required
>            from the "English-only" ASCII-sites, where no 8-bit articles
>            would originate or be read.

The "only" change, yes, but a change which you simply cannot expect to be done.
No hope at all!

>Change #2:  Any computer on Usenet using an extended version of ASCII (CP 437,
>            ISO 8859/x etc) must translate all postings to one of the 8859/x
>            charcter sets and indicate (in the header) which one is used.

This wouldn't do it if the recipient end cannot handle that character set.
Or said in another way: which one of the 8859/x character sets should you
use?  8859/x is probably *the* answer for use within a certain country on
*most* UNIX boxes, but what about all the PC character sets, EBCDIC hosts
etc.  Don't you think a little more than 5% of the users are in that
category?

>Change #3:  Any computer not using ASCII, but rather EBDIC (or something else),
>            must translate all postings to one of the 8859/x character sets,
>            instead of just translating to ASCII.  

If they have to translate, they can just as well translate into something
which has a good chance of getting through the network - and 8859 doesn't
have a chance there.

>Change #4:  Any computer must accept postings in one of the 8859/x character
>            sets and be able to translate them to the character set used
>	    by each user.

But what if I support 8859/1 and get an article written in 8859/7 (greek?)
If we use your scheme, *all* the 8859/x sets must be accepted!

>Problem #1: If the local character set is not able to represent all the
>            charactes in the original posting, they must be represented as
>            well as possible.  For example - a 7-bit computer receiving a text
>            containing accented wovels might be expected just to drop the
>            accent marks.

Which may in some cases completely change the meaning!

>Problem #2: Different users - even on the same machine - have different
>            capabilities to display 8-bit text.  For example, in Scandinavia
>            it is common for terminals to use a 7-bit character set, where
>            some of the characters (for example { [ ] } |) have been replaced
>            by non-ASCII characters.  Other users in the same countries have
>            fully 8-bit terminals (for example PCs running an terminal
>            emulator).  The computer must store incoming articles as they
>            arrive and the news/E-mail software must be updated to display
>            them according to the capabilities of each terminal, as indicated
>            by an environment variable.

Exactly, and that is definitely easiest if everybody agrees on *one*
common "carrier character set" (my suggested term for such a character set).

>So - what now ?

>Is there any interest in creating a "working group" to attack the problem ?
>Any of the authors of rn, nn, elm or other news/e-mail software out there ?

Yes, support for Keld's multi character set handling is planned for an
upgrade to nn 6.4 later this year.  We have been looking at what can be
used as the escape character in news, and this is definitely a problem,
since inews traditionally is very restrictive with respect to what it
will pass through (^] is filtered out as most other control characters).

But we believe we have found the right solution, which will pass
through at least Bnews' inews, and is supposed to be *transparent* to
most news interfaces:  We use a double escape character consisting of
a "space" followed by a "backspace".  When output to a screen this will be
invisible and most pagers will handle backspace properly (i.e. move the
cursor back over the space).  And we think it is very unlikely that this
sequence will occur in normal postings (we see no purpose for it).

And since only articles which have the proper header specifying that this
is really an encoded article will be "decoded", the filters which encode
the articles at the originating end can check that no such sequences exist
in the original text.

>We are of course willing to share our modifications to the programs, and with
>a bit of work we should be able to have 8-bit news/email running in a few
>months.

nn users world-wide can soon exchange multi character news - other users can
read it (without problems), and we will publish our code and specifications
so that other interfaces can support it as well.

>So - any volunteers ?

Yes, but is there any interest in what we plan to do???

And will our "space-backspace" escape pass through Cnews, NNTP and
other inews/relaynews/whatever implementations (without modification)?

-- 
Kim F. Storm  <storm@texas.dk>		No news is good news,
Texas Instruments A/S, Denmark		  but nn is better!

ed@braaten.doit.sub.org (Ed Braaten) (08/05/90)

VERKADE@CTSS.CO.UK (Herman Verkade) writes:

>A couple of comments on 8 bit news. It seems to me that it is not necesary to
>convert the whole net to 8 bit. The 7 bit restriction is only a problem for
>specific newsgroups: newsgroups in languages other than english and newsgroups
>containing binary data, such as bitmaps, .gif files, etc. So, I don't think
>**everybody** needs to upgrade to some implementation that supports 8 bits.
>Only those that wish to carry newsgroups, that need it. All we would need
>is a standard, not necesarily a world-wide upgrade of software.

I think this is the right approach to the problem.  If it works, don't
fix it! ;-)  But give the non-English and binary people a chance.  A 
standard, however is an absolute must.

>My proposal would be RFC-1154-style, because it also allows one message to
>contain encodings in different parts and could therefore also be used to
>automaticaly convert different parts of a message in 7 bit groups. For
>example, a message containing a uuencoded file preceded by some explanation
>in ASCII and a signature at the bottom, could have a header such as:

>    Encoding: 10 text, 1045 uuencode, 5 text

>A smart news reader could display the two text parts and ask whether you want
>the uuencode bit to be uudecoded. For an article containing a header like:

>    Encoding: 15 text, 637 uugif, 5 text

>the reader could then automatically extract the uuencoded .gif file and
>display an image instead. Etc, etc, etc. And only users that want such
>functionality switch to a news reader that supports it.

How about it?  Could we get the author of nn sold on this?  (I'm
crossposting this article to n.s.nn to find out...)

>I realise that I am discussing two seperate topics here:
>1) Provide 8 bit transport mechanisms so that international character sets
>   can be used, but enable 8 bits only on a newsgroup by newsgroup basis
>   with either a designated character set for such a group, or an Encoding
>   header to indicate the character set.
>2) An Encoding: header for carrying data other that text (in either 7 or
>   8 bit groups).

I like your suggestions Herman.  What about the rest of the net?
Opinions?  Comments?

Greetings from Munich,

Ed

---------------------------------------------------------------------------
        Ed Braaten             |  Jesus answered,  "I am the way and the
Work: ed@imuse.de.intel.com    |  truth and the life.  No one comes to the
Home: ed@braaten.doit.sub.org  |  Father except through me."   John 14:6 
---------------------------------------------------------------------------

mcmahon@tgv.com (John McMahon) (08/06/90)

In article <3863a@braaten.doit.sub.org>, ed@braaten.doit.sub.org (Ed Braaten) writes...
>>My proposal would be RFC-1154-style, because it also allows one message to
>>contain encodings in different parts and could therefore also be used to
>>automaticaly convert different parts of a message in 7 bit groups. For
>>example, a message containing a uuencoded file preceded by some explanation
>>in ASCII and a signature at the bottom, could have a header such as:
> 
>>    Encoding: 10 text, 1045 uuencode, 5 text

My understanding is that an RFC is in the works for "non-textual tranmission of
data via E-mail".  I suspect this could be easily expanded to include USENET
NEWS.

Watch the NIC for announcements of new RFCs...

John 'Fast-Eddie' McMahon    :    MCMAHON@TGV.COM    : TTTTTTTTTTTTTTTTTTTTTTTT
TGV, Incorporated            :                       :    T   GGGGGGG  V     V
603 Mission Street           : HAVK (abha) Gur bayl  :    T  G          V   V
Santa Cruz, California 95060 : bcrengvat flfgrz gb   :    T  G    GGGG   V V
408-427-4366 or 800-TGV-3440 : or qrfgeblrq ol znvy  :    T   GGGGGGG     V