[news.software.b] New USENET header: Language

brad@looking.on.ca (Brad Templeton) (12/22/90)

I propose a new USENET header item, namely "Language:"

The default, for historical reasons, would be "Language: English" but
other fields would be fine.   And sorry to be so anglo-centric, but
I suspect that the language names should be the English names for the
languages, since it is consistent with the use of English header names.

ie.

Language: French

not: "Language: Francais" or "Langue: Francais"

The need for this grows clear in an international net.  People will want
to use more and more languages, and those who can't read them will want
to conveniently skip such postings.   And indeed, sometimes even not feed
such postings.

Bilingual postings, as are often found in can.general, (where a person
posts in English and the followup, with included text, is in French) could
have "Language: English, French" although that doesn't fly well with the
current kill file mechanism.  (It works fine with something like newsclip)

I would encourage that we avoid cuteness like "Language: C" for source
postings.   We really need another header to distinguish between postings
for human consumption and computer consumption.  Besides, C programs still
need a language header to mark the language of their variables and comments.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

rees@pisa.ifs.umich.edu (Jim Rees) (12/23/90)

In article <1990Dec22.081718.2109@looking.on.ca>, brad@looking.on.ca (Brad Templeton) writes:

  I propose a new USENET header item, namely "Language:"
  
  The default, for historical reasons, would be "Language: English" but
  other fields would be fine.   And sorry to be so anglo-centric, but
  I suspect that the language names should be the English names for the
  languages, since it is consistent with the use of English header names.

I like this a lot, but it's really just the tip of the iceberg.  We need a
way to feed non-latin scripts through the net.

Way back in 1983 I posted some patches to make B news 8-bit safe.  Almost
all new news transport is now done either by nntp or uucp, both of which are
8-bit safe, so it's just a matter of making sure your relay program doesn't
strip bits.  Are modern B-news and C-news 8-bit safe?

The other half of this is fixing the reading and posting programs.  For
example, xrn could be fixed so that if it sees "Language: Japanese" in the
header, it switches to a text widget that can display Japanese text.  This
would require agreement on which of the Japanese text encoding schemes to
use (EUC, JIS, etc.) but this shouldn't be hard if we discuss it here and
just pick one.  There are already text widgets to display Japanese so
hooking one up to xrn shouldn't be hard.

Right now there is a debate going on in soc.culture.lebanon on the best way
to write Arabic using a latin script.  That misses the boat, as I see it.
What we should be doing is sending and displaying actual Arabic characters,
not some romanised bastardisation.  This has already happened in
soc.culture.vietnamese, where postings are full of diacriticals following the
letters they should go over: "da^'u na(.ng" is a poor substitute for real,
written Vietnamese.

I'll shut up now since I'm not volunteering to do any real work.

nmouawad@watmath.waterloo.edu (Naji Mouawad) (12/23/90)

In article <4ec122f7.1bc5b@pisa.ifs.umich.edu> rees@citi.umich.edu
(Jim Rees) writes:

   We need a way to feed non-latin scripts through the net.  Way
   back in 1983 I posted some patches to make B news 8-bit safe.
   Almost all new news transport is now done either by nntp or uucp,
   both of which are 8-bit safe, so it's just a matter of making
   sure your relay program doesn't strip bits.  Are modern B-news
   and C-news 8-bit safe?

In fact, for Arabic you 7-bit suffices since, Arabic has 28 letters,
each of which can be written in at most 3 different ways, makes a
total of 84 combinations, way below the 128 characters in the 7-bit
representation.


  The other half of this is fixing the reading and posting
  programs.  For example, xrn could be fixed so that if it sees
  "Language: Japanese" in the header, it switches to a text widget
  that can display Japanese text.  This would require agreement on
  which of the Japanese text encoding schemes to use (EUC, JIS,
  etc.) but this shouldn't be hard if we discuss it here and just
  pick one.  There are already text widgets to display Japanese so
  hooking one up to xrn shouldn't be hard.


This seems to be a good starting point. I've heard that GnuEmacs can
be tweaked to support right to left editing. If this is the case and
assuming there is a way to swap fonts on an ascii terminal (such as 
a Vt100, wyse or whatever) to replace the Latin alphabet with the Arabic
alphabet, then we should be all set.

We cannot at present times requiere an Xwindows terminal to read
news.  As a rule, any terminal that give access to the news in
English, should just as easily give access to the news in a
different language.


    Right now there is a debate going on in soc.culture.lebanon on
    the best way to write Arabic using a latin script.  That misses
    the boat, as I see it.  What we should be doing is sending and
    displaying actual Arabic characters, not some romanised
    bastardisation.  

That'll be really nice ! I am more than willing to stop looking for
a standard way of writting Arabic in Latin alphabet if we can write
using the Arabic alphabet.

Any suggestions are welcome.

   I'll shut up now since I'm not volunteering to do any real work.

Thanks for your comments.

--Naji.
-- 
     -------------------------------------------------------------------
    | Naji Mouawad  |          nmouawad@watmath.waterloo.edu            |
    |  University   |---------------------------------------------------|
    | Of Waterloo   |   "The Stranger in us is our most familiar Self"  |

brad@looking.on.ca (Brad Templeton) (12/23/90)

It is my understanding that 8 (or more) bit representations of the non-roman
character set are not fully standardized.  Is this wrong?

In such an event, you need two headers -- one for the natural language, and
another for the encoding format.

Note that encoding format can be a broad format, as it is really for
computer consumption.  Valid forms could be ASCII (with underlining) -- the
default, various extended-Ascii formats, Katanji etc. but also binary,
image, rich text etc.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

spike@world.std.com (Joe Ilacqua) (12/24/90)

In article <1990Dec22.081718.2109@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:

Xref: world trial.soc.culture.italian:13 news.software.b:3164
Path: world!uunet!looking!brad
From: brad@looking.on.ca (Brad Templeton)
Newsgroups: trial.soc.culture.italian,news.software.b
Date: 22 Dec 90 08:17:18 GMT
Followup-To: news.software.b
Organization: Looking Glass Software Ltd.
Lines: 29

<I propose a new USENET header item, namely "Language:"
>The default, for historical reasons, would be "Language: English" but
<other fields would be fine.

	Something is definitely needed.  I already have German, French, and
Japanese hierarchies on my system, and I am sure there are more
coming.  (Anyone want to write "Japanese for News Administrators",  I
could use it).  I think we need something more than just "Language:",
maybe "Encoding:".  For example:

Encoding: ASCII
Encoding: JIS
Encoding: ISO-(mumble)

	I would think that ASCII would be the default.  And of course
as we move to multi media news "Encoding:" would also be useful:

Encoding: GIF89
Encoding: FACESAVER
Encoding: RFC-(mumble)

	With "Language:" I think you might want a region, as in
"American English" or "Australian English" or even "California
English".  That way your news "reader" could have the right accent ;-)

->Spike

-- 
The World - Public Access Unix - +1 617-739-9753  24hrs {3,12,24,96,192}00bps

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (12/24/90)

Putting the character set in the header limits us to only one language per 
article.  I'd much prefer to see some escape mechanism in the body.  
This would allow things like an english posting with a proper xxx 
language name in the signature.  

Putting the language (vs the char set) in the header is fine with me.

NNTP, C News (except for some utilities on some systems) and trn
all seem to support 8 bit character sets.

-- 
Jon Zeeff (NIC handle JZ)	 zeeff@b-tech.ann-arbor.mi.us

john@george.Jpl.Nasa.Gov (Hung P. Ho Jr.) (12/25/90)

In article <4ec122f7.1bc5b@pisa.ifs.umich.edu> rees@citi.umich.edu (Jim Rees) writes:
>In article <1990Dec22.081718.2109@looking.on.ca>, brad@looking.on.ca (Brad Templeton) writes:
>Right now there is a debate going on in soc.culture.lebanon on the best way
>to write Arabic using a latin script.  That misses the boat, as I see it.
>What we should be doing is sending and displaying actual Arabic characters,
>not some romanised bastardisation.  This has already happened in
>soc.culture.vietnamese, where postings are full of diacriticals following the
>letters they should go over: "da^'u na(.ng" is a poor substitute for real,
>written Vietnamese.

Yes, I agree that the marks next to the letter are not that easy to read
or understand.. But it is the best way we have come up with.. Using
this scheme, If you have a workstation running X or other windows, 
there are preprocessors that can convert these marks into real
Vietnamese characters.  For hard copy, there are Latex/Nroff processors.
On the other hand, for a large percentage of users, they are still
using dumb terminals, for which this scheme will still work.

Your proposal for an automatic language processor is a good idea.
But I think the project would be too big to handle.. Based on the
new ISO works, we will see the standard for special characters
very soon...

Let us know if you success.. Good Luck

                    ._. ._____   ._.		
IAS JETSUN Net      | | | .__ \  | |		JOHN HO, System Administrator
                    | | | |__) | | |		   John@elroy.jpl.nasa.gov
 NASA/CALTECH     __| | | |\__/  | |___		   ..!cit-vax!elroy!john
                /_____| |_|      |_____\	   hho@chaph.usc.edu

lyndon@cs.athabascau.ca (Lyndon Nerenberg) (12/25/90)

>  The other half of this is fixing the reading and posting
>  programs.  For example, xrn could be fixed so that if it sees
>  "Language: Japanese" in the header, it switches to a text widget
>  that can display Japanese text.  This would require agreement on
>  which of the Japanese text encoding schemes to use (EUC, JIS,
>  etc.) but this shouldn't be hard if we discuss it here and just
>  pick one.

The concepts of language and character set are completely distinct.
It's quite possible to build a one to many (or many to many) mapping
of languages into character sets. The proposed Language: header should
be used for its originally intended purpose: to indicate what language(s)
the author used in the posting. The issue of selecting character sets
and encodings should be addressed seperately, and not inferred from the
Language: header.

-- 
    Lyndon Nerenberg  VE6BBM / Computing Services / Athabasca University
        {alberta,cbmvax,mips}!atha!lyndon || lyndon@cs.athabascau.ca
                    Packet: ve6bbm@ve6mc [.ab.can.na]
      The only thing open about OSF is their mouth.  --Chuck Musciano

news@m2xenix.psg.com (Randy Bush) (12/25/90)

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) writes:

> Putting the character set in the header limits us to only one language per 
> article.
> ...
> Putting the language (vs the char set) in the header is fine with me.

Does not putting the language in the header limit the message to one language
almost as severely as putting the character set there?
-- 
..!{uunet,qiclab,intelhf,bucket}!m2xenix!news

rees@pisa.ifs.umich.edu (Jim Rees) (12/25/90)

In article <1990Dec23.030622.12129@looking.on.ca>, brad@looking.on.ca (Brad Templeton) writes:

  It is my understanding that 8 (or more) bit representations of the non-roman
  character set are not fully standardized.  Is this wrong?
  
  In such an event, you need two headers -- one for the natural language, and
  another for the encoding format.

Some are standardized, some are not.  I think rather than have the header
specify the encoding, we should pick one encoding for each language and make
that the standard on Usenet.  I would hope we can pick the technically
superior encoding.  For example, some European encodings substitute
non-ASCII symbols (umlauts, e.g.) for '|', '{', '}', etc.  That's a bad
idea.  We should adopt something like Latin-1 that adds the new symbols to
the old set.

Most of the encodings I've seen put ASCII in the bottom half of the 8-bit set.
This lets you mix ASCII with a foriegn language without having to change
modes.

  Note that encoding format can be a broad format, as it is really for
  computer consumption.  Valid forms could be ASCII (with underlining) -- the
  default, various extended-Ascii formats, Katanji etc. but also binary,
  image, rich text etc.

There are all kinds of nice multi-media things we could do, but for now I
think we should keep it simple.

amanda@visix.com (Amanda Walker) (12/27/90)

I don't see a whole lot of use for a "Language:" header, but for
alternate alphabets, we could always say something like:

	Content-Type: ISO-2022

And then allow alphabet announcers and shifts in the article body.
The biggest problem would be upgrading 'rn' so that it could take
advantage of alternate character sets in dumb terminals, and map
things appropriately when possible.

Or we could just throw out current formats altogether and use the
X.400 message content encoding...

Usenet II, anyone?

-- 
Amanda Walker						      amanda@visix.com
Visix Software Inc.					...!uunet!visix!amanda
--
"We are holding Elvis Presley's brain hostage on Planet Zort. Surrender Now."
		--Bloom County

mikew@fx.com (Mike Wexler) (12/28/90)

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) writes:

>Putting the character set in the header limits us to only one language per 
>article.  I'd much prefer to see some escape mechanism in the body.  
>This would allow things like an english posting with a proper xxx 
>language name in the signature.  
No it doesn't. There are character sets such as ISO 10646 or even to
some extent the ISO 8859-X stanards that support multiple languages.
Adding the header would be easy and it would be relatively easy for an
X based newsreader to change encodings/fonts for different articles.
--
Mike Wexler (mikew@fx.com)

henry@zoo.toronto.edu (Henry Spencer) (12/28/90)

In article <4ec122f7.1bc5b@pisa.ifs.umich.edu> rees@citi.umich.edu (Jim Rees) writes:
>... Are modern B-news and C-news 8-bit safe?

We believe that C News is fully 8-bit clean, subject to two caveats:

1. Some processing of certain things is done by shell files, using various
	ordinary Unix utilities, and if they are not 8-bit clean, C News
	can't be.

2. Certain things in article headers are limited to 7 bits by the defining
	documents (RFC822 and RFC1036) even if the rest of the message uses
	a full 8 bits, and it is possible that C News depends on this in
	small ways.
-- 
"The average pointer, statistically,    |Henry Spencer at U of Toronto Zoology
points somewhere in X." -Hugh Redelmeier| henry@zoo.toronto.edu   utzoo!henry

mann@intacc.uucp (Jeff Mann) (12/28/90)

In article <ZM-+Y6D@b-tech.uucp> zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) writes:
>Putting the character set in the header limits us to only one language per 
>article.  I'd much prefer to see some escape mechanism in the body.  

Yes, of course!! Brad's original example was the Canadian newsgroup where
articles often include text from one article in English, with a reply
in French. Having a Lanuage header would restrict each article to one
specific language, and this would break down in the face of bilingual
postings. However, escape codes in the body seem to be kind of a problem
to me (writing from my adm3a terminal :-) - now what?

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
|  Jeff Mann  Inter/Access Artists' Computer Centre, Toronto  [416] 535-8601 |
| ...uunet!mnetor!intacc!mann   intacc!mann@nexus.yorku.ca  mann@intacc.uucp |
|      The Matrix Artists' Computer Network BBS: [416] 535-7598 2400 8N1     |
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (12/28/90)

>>Putting the character set in the header limits us to only one language per 
>>article.  I'd much prefer to see some escape mechanism in the body.  
>>This would allow things like an english posting with a proper xxx 
>>language name in the signature.  

>No it doesn't. There are character sets such as ISO 10646 or even to
>some extent the ISO 8859-X stanards that support multiple languages.
>Adding the header would be easy and it would be relatively easy for an
>X based newsreader to change encodings/fonts for different articles.

Ok, I should have said "...limits us to one char set per article".  So not 
one language, but fewer languages or at less efficiency.  Even ISO 10646
can not support language x and y for any x and y.

I don't see that specifying the char set in the header (vs the body) is
any easier to implement in a news reader.

Allowing switching of the character set within the body is going to be 
more flexible and efficient than doing it in the header.  

-- 
Jon Zeeff (NIC handle JZ)	 zeeff@b-tech.ann-arbor.mi.us

Dan@dna.lth.se (Dan Oscarsson) (12/29/90)

In article <R9A+N_=@b-tech.uucp> zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) writes:
>>>Putting the character set in the header limits us to only one language per 
>>>article.  I'd much prefer to see some escape mechanism in the body.  
>>>This would allow things like an english posting with a proper xxx 
>>>language name in the signature.  
>
>>No it doesn't. There are character sets such as ISO 10646 or even to
>>some extent the ISO 8859-X stanards that support multiple languages.
>>Adding the header would be easy and it would be relatively easy for an
>>X based newsreader to change encodings/fonts for different articles.
>
>Ok, I should have said "...limits us to one char set per article".  So not 
>one language, but fewer languages or at less efficiency.  Even ISO 10646
>can not support language x and y for any x and y.

The only character set needed is ISO 10646. It covers nearly every character
in the world. If we use ISO 10646 as THE ONLY CHARCTERSET in articles
everybody will always know what character set is used and it can be used
both in headers and body. Instead of having thousands of translation tables
for each possible character set, we only need to know how to translate
from one to the local one used at a site.
Also articles in ascii or ISO 8859-1 can directely be used as they are
true subsets of ISO 10646.

   Dan


-- 
Dan Oscarsson                              Department of Computer Science
                                           Lund Institute of Technology
e-mail:  Dan@dna.lth.se                    Box 118
                                           S-221 00 Lund, Sweden

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (12/29/90)

>The only character set needed is ISO 10646. It covers nearly every character
>in the world. 

I'll agree that using just ISO 10646 would be simple and would solve 
the current problems with other language use on usenet.  Even so, 
"nearly" in the second sentence points out the potential problem with 
the first sentence.  

The efficiency issue would have to be addressed also.  You would probably
end up with some compression method (which is what an escape sequence to
switch char sets is).

-- 
Jon Zeeff (NIC handle JZ)	 zeeff@b-tech.ann-arbor.mi.us

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (12/29/90)

>articles often include text from one article in English, with a reply
>in French. Having a Lanuage header would restrict each article to one
>specific language, and this would break down in the face of bilingual
>postings. 

How about allowing "Language: English/French" in the header?  

As I see it, the language header would be an advisory thing that allows
readers (or sites) to avoid articles they can't read.

>However, escape codes in the body seem to be kind of a problem
>to me (writing from my adm3a terminal :-) - now what?

This is a completely different issue (char sets).  Your system would 
do a best fit translation from the char set of the article to the char 
set of your terminal.  

-- 
Jon Zeeff (NIC handle JZ)	 zeeff@b-tech.ann-arbor.mi.us

iacovou@cs.umn.edu (Danny Iacovou) (12/30/90)

In <_2B+P*A@b-tech.uucp> zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) writes:

>>articles often include text from one article in English, with a reply
>>in French. Having a Lanuage header would restrict each article to one
>>specific language, and this would break down in the face of bilingual
>>postings. 

>How about allowing "Language: English/French" in the header?  

   
    Taking this idea one step further, we can use lang. reference 
    headers. We can have something like this:

        Lang References: <English> <French> <French> <English>

    I don't think that the current "References" header should be
    tampered with in order to accomadate langages.


-- 
--------------------------------------------------------------------------------
Neophytos Iacovou                                
University of Minnesota                        email:  iacovou@cs.umn.edu 
Computer Science Department                            !rutgers!umn-cs!iacovou

emv@ox.com (Ed Vielmetti) (12/30/90)

In article <1990Dec22.081718.2109@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:

   I propose a new USENET header item, namely "Language:"

   The default, for historical reasons, would be "Language: English" but
   other fields would be fine.   And sorry to be so anglo-centric, but
   I suspect that the language names should be the English names for the
   languages, since it is consistent with the use of English header names.

Apropos of nothing, I looked at Prodigy today at Sears (they had a
real interactive session live, so I poked around.)  Contributions to
Prodigy's message areas *must* be in English; apparently their censors
are all good English-speaking Americans.  Text in another language
will be rejected.

--Ed
emv@ox.com

roy@phri.nyu.edu (Roy Smith) (12/30/90)

 mann@intacc.UUCP (Jeff Mann) writes:
-> Having a Lanuage header would restrict each article to one specific
-> language, and this would break down in the face of bilingual postings.

Why?  What's wrong with "Language: French,English".
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy
"Arcane?  Did you say arcane?  It wouldn't be Unix if it wasn't arcane!"

Dan@dna.lth.se (Dan Oscarsson) (12/30/90)

In article <91B+H%A@b-tech.uucp> zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) writes:
>>The only character set needed is ISO 10646. It covers nearly every character
>>in the world. 
>
>I'll agree that using just ISO 10646 would be simple and would solve 
>the current problems with other language use on usenet.  Even so, 
>"nearly" in the second sentence points out the potential problem with 
>the first sentence.  
>
True, but the characters missing today are very few and there are plenty
of space left that can be filled in with missing characters at a revision
of the standard.

>The efficiency issue would have to be addressed also.  You would probably
>end up with some compression method (which is what an escape sequence to
>switch char sets is).
ISO 10646 defines sequences for switching character length and subset of
the total set. This allows letters using ASCII or ISO 8859-1 to be sent
exactely as today with not one character more. ISO 10646 can use 8,16,24 and
32 bits per character and can dynamically change between them so the
standard defines a way to efficiency send text.

   Dan
-- 
Dan Oscarsson                              Department of Computer Science
                                           Lund Institute of Technology
e-mail:  Dan@dna.lth.se                    Box 118
                                           S-221 00 Lund, Sweden

rfortier@faux (Richard W. Fortier) (12/31/90)

In article <1990Dec29.093002.10739@lth.se> Dan@dna.lth.se (Dan Oscarsson) writes:
>The only character set needed is ISO 10646. It covers nearly every character
>in the world. 

Forgive me if I miss the point of the original article; if I do it's because
I don't HAVE the original article.  But I think Dan's point leads to an
obvious conclusion; that the new header field should declare the
CHARACTER set, not the language.  This is much more general.

The only reason I can see for declaring the language in the header is 
so I can instruct my news reader I only want to read articles in Latin
(hopefully, picking a "dead" language for the example avoids offending
anyone).  That may also be a useful feature, but it is rather independent
from the issues of how to display the characters in the message.

Rich


-- 
---
Richard W. Fortier
In US:	        		|In France:
				|  Advanced Computer Research Institute 

spike@world.std.com (Joe Ilacqua) (12/31/90)

In article <ZM-+Y6D@b-tech.uucp> zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) writes:
<Putting the character set in the header limits us to only one language per 
<article.  I'd much prefer to see some escape mechanism in the body.  
<This would allow things like an english posting with a proper xxx 
<language name in the signature.  

	Read my message again.  I said put the encoding method in the
header, I did not say the encoding method had to be a font There is no
reason "Encoding:" could not just specify a method of switching fonts.
In fact I would go as far as to say that 'Encoding: JIS' is just that,
it sez this article uses JIS escape codes to change fonts.

	As for multiple languages, list all the languages and dialects
in the "Language:" header and use a font or encoding scheme that gives
you the character sets you need.  Just because I post a message
encoded in ASCII doesn't mean you can't follow up in JIS, so long as
your posting server converts my text and sets the correct "Encoding:"
header.

	I want to know alot about the form of an article and I do not
want to grovel through the text trying to figure it out.

->Spike
-- 
The World - Public Access Unix - +1 617-739-9753  24hrs {3,12,24,96,192}00bps

root@questor.wimsey.bc.ca (Postmaster) (12/31/90)

emv@ox.com (Ed Vielmetti) writes:

> In article <1990Dec22.081718.2109@looking.on.ca> brad@looking.on.ca (Brad Tem
> 
>    I propose a new USENET header item, namely "Language:"
> 
>    The default, for historical reasons, would be "Language: English" but
>    other fields would be fine.   And sorry to be so anglo-centric, but
>    I suspect that the language names should be the English names for the
>    languages, since it is consistent with the use of English header names.

I agree with the proposal.  Waffle BBS Software (which we use at this
site) already supports an 8-bit mode so that Cyrillic characters may be
used, as well as diacritical marks on latin characters, such as e-acute,
o-umlaut, and most others.

---
   Steve Pershing, System Administrator

| The QUESTOR PROJECT - Free Usenet News/Internet Mail; Sci, Med, AIDS, more |
|----------------------------------------------------------------------------|
| Usenet:  sp@questor.wimsey.bc.ca      |  POST: 1027 Davie Street,  Box 486 |
| Phones:  Voice/FAX:  +1 604 682-6659  |        Vancouver, British Columbia |
|          Data/BBS:   +1 604 681-0670  |        Canada  V6E 4L2             |

Dan@dna.lth.se (Dan Oscarsson) (12/31/90)

In article <1990Dec30.205849.1120@faux> rfortier@faux.UUCP (Richard W. Fortier) writes:
>In article <1990Dec29.093002.10739@lth.se> Dan@dna.lth.se (Dan Oscarsson) writes:
>>The only character set needed is ISO 10646. It covers nearly every character
>>in the world. 
>
>Forgive me if I miss the point of the original article; if I do it's because
>I don't HAVE the original article.  But I think Dan's point leads to an
>obvious conclusion; that the new header field should declare the
>CHARACTER set, not the language.  This is much more general.
>
There is one very bad thing about having a character set or a language header:
I am not going to support an nearly infinite set of conversion tables at my
site to be able to read articles! There are also a nearly infinite set of
possible names of character sets and languages. If will be impossible to
keep all the tables up to date. I can see no way we can manage to have the
tables available and consistent at every site in the world.

A single character set that covers all characters will make the special
headers unneeded. Only tables between that character set and the few local
character sets at a site is needed at each site. There will be no confusion
about which character set or language the name in a header means, as the
character set will always be the same.

   Dan




-- 
Dan Oscarsson                              Department of Computer Science
                                           Lund Institute of Technology
e-mail:  Dan@dna.lth.se                    Box 118
                                           S-221 00 Lund, Sweden

urlichs@smurf.sub.org (Matthias Urlichs) (01/01/91)

In news.software.b, article <1990Dec28.071825.28660@zoo.toronto.edu>,
  henry@zoo.toronto.edu (Henry Spencer) writes:
< In article <4ec122f7.1bc5b@pisa.ifs.umich.edu> rees@citi.umich.edu (Jim Rees) writes:
< >... Are modern B-news and C-news 8-bit safe?
< 
< We believe that C News is fully 8-bit clean, subject to two caveats:
< 
< 1. Some processing of certain things is done by shell files, using various
< 	ordinary Unix utilities, and if they are not 8-bit clean, C News
< 	can't be.
< 
That applies mainly to inews (which also strips control characters -- one
might argue that this is the news reader's job), so if the kernel is 8-bit
clean, the stuff might at least pass through OK.

NB: What, if anything, is wrong with using RFC 1049 for character sets and
another header (maybe call it "Content-Language"), structured similarly, for
language?
-- 
Matthias Urlichs -- urlichs@smurf.sub.org -- urlichs@smurf.ira.uka.de     /(o\
Humboldtstrasse 7 - 7500 Karlsruhe 1 - FRG -- +49+721+621127(0700-2330)   \o)/

brad@looking.on.ca (Brad Templeton) (01/01/91)

You can eliminate the need for a character set header, but you can't
eliminate the need for a language header.  Well, perhaps with very smart
software you could...

The language header is for use by readers and propogators.  It is to allow
people to quietly filter out postings in languages they can't read, or
to allow newsfeeds within one country to keep articles in local languages
(posted to global groups) within the confines of sites where such
languages are used.

As for character set, I agree that it's best to define one fairly
universal set, and mappings from it to other desired sets.

However, this is not all you have to define in a more multimedia system.

Right now the only option we have is ASCII plain text.  But there are
many other forms we might like.

For example, "rich text" with imbedded binary information, graphics, font
information, etc. is something we might like to use in a message.  Certainly
we want to extend past the "ASCII with underlining" we have now.

Another format is binary -- one that modern news feeders support but which
nobody has dared to use as yet.  (The readers don't support it well.)  Or
perhaps we want text+binary, where a message has introductory text, and then
a raw binary.

Sometimes we may want to identify the binary, if it's an image, for example.
The reader can then display it.   As far as I can see we only need to define
official headers for things that readers, filterers and feeding programs need
to know about.   Thus whether it's an a.out or .exe neeed not be in our
message-type header.

Anyway, here's a list of possibles:

	Plain ASCII with underlining (current default)
	Universal Character set
	Rich text -- ASCII
	Rich text -- Universal character set
	Andrew style rich text.
	Postscript
	Raw Binary
	Image binary (GIF,TIFF,?,standardize?)

If our rich text format is rich enough, we might consider merging all the
binary formats into it, except perhaps the raw binary format which is
not intended to be interpreted by the reader at all.   Otherwise all binary
formats should probably include a preface section in one of the text forms,
to allow for description beyond what can go in headers.)

Text formats with natural language in them should of course have a language
header indicating the human language or languages.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

spike@world.std.com (Joe Ilacqua) (01/01/91)

<Anyway, here's a list of possibles:
[...]

Let us not forget audio binaries.

->Spike
-- 
The World - Public Access Unix - +1 617-739-9753  24hrs {3,12,24,96,192}00bps

sef@kithrup.COM (Sean Eric Fagan) (01/01/91)

In article <1990Dec31.195124.1249@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:
>You can eliminate the need for a character set header, but you can't
>eliminate the need for a language header.  Well, perhaps with very smart
>software you could...

I think all the discussion that has aroused shows that there's more of a
need (or, at least, desire) for a Format:/Encoding: header field than a
Language: field.

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

rad@railnet.UUCP (Rick DeMattia) (01/01/91)

root@questor.wimsey.bc.ca (Postmaster) writes:

> emv@ox.com (Ed Vielmetti) writes:
> 
> > In article <1990Dec22.081718.2109@looking.on.ca> brad@looking.on.ca (Brad T
> > 
> >    I propose a new USENET header item, namely "Language:"

> I agree with the proposal.  Waffle BBS Software (which we use at this
> site) already supports an 8-bit mode so that Cyrillic characters may be
> used, as well as diacritical marks on latin characters, such as e-acute,
> o-umlaut, and most others.
> 

Out of curiosity, how do you implement the aforementioned characters?  
IBM-PC character-set?  Is your method of implementation standard across 
platforms?

-- 
UUCP: {uunet|backbone}!nshore!railnet!rad       rad@railnet.uucp
CompuServe: 72517.666@compuserve.COM

keld@login.dkuug.dk (Keld J|rn Simonsen) (01/02/91)

My two cents worth on the Language: header:

There is an ISO standard ISO 639:1988 which contains all the languages
in the world with their English, French and native names - and
a standard two-letter abbrevation. I will suggest that the 
abbrevation should be used in this proposed header.

An exert of the ISO 639 standard is available by anon ftp in
dkuug.dk:i18n/ISO_639 - it has all abbrevations and the English
name listed, but no French nor native names.

Concerning the character set, there was a discussion on the
news early last year (1990!) - where Kim Storm (the author of nn)
announced that support for extended character sets was planned for nn.
I think he is still planning this, but the implementation is becoming
nearer.

Kim Storm planned on using some software I have developed. 
This SW can handle (in the current release) some 60 different character
sets and represent in each character set all of the others via some
quite mnemonic encoding. Thus you can have presented what your
equipment is capable of, and the rest can be deciphered too,
without loss of information. You can use an encoded ASCII as the
transport character set, this should pass any news agent in the world.

I am employing this SW at dkuug.dk - the Danish Internet mail
backbone and uucp gateway, via a sendmail 5.64 + IDA implementation.
I there have implemented the headers X-Charset: and X-Char-Esc:
to specify the encoding. I am also working together with Dan Oscarsson
on doing a truly ISO 10646 based sendmail.

The Sendmail and character set SW is available from dkuug.dk:pub

Keld Simonsen

davidg@aegis.or.jp (Dave McLane) (01/02/91)

rad@railnet.UUCP (Rick DeMattia) writes:

> > I agree with the proposal.  Waffle BBS Software (which we use at this
> > site) already supports an 8-bit mode so that Cyrillic characters may be
> > used, as well as diacritical marks on latin characters, such as e-acute,
> > o-umlaut, and most others.

I'm also using Tom Dell's Waffle ( I don't call it a BBS, though,
as that leads people to expect "menus and messages" instead of a
"prompt and articles") and am going to modify it (I signed up for
the C source code) to handle not only English and foreign vowels,
but Japanese kanji as well.

* Beware: There are places in Waffle V1.63 which strip the 8th bit!

Right now I'm running Waffle on a 386SX and have been trying to
get a copy of the UNIX version as I now have all the pieces for
an Interactive System V/386 system.

I'v emailed Tom three times now at <dell@vox.darkside.com> but no reply. If
anybody (especially Tom) knows his whereabouts, would you let me know
here in alt.bbs (I don't have email to the internet worked out yet).

Thanks in advance,

--Dave McLane

==== The Aegis Society =============================================
Minami Hirao 1-6, Imazato                 The content and process of
Nagaokakyo-shi, Kyoto-fu, 617 Japan           international/cultural
Tel: +81-75-951-1168 Fax: +81-75-957-1087             communication.
====================================================================

matoh@sssab.se (Mats Ohrman) (01/02/91)

spike@world.std.com (Joe Ilacqua) writes:
>	As for multiple languages, list all the languages and dialects
>in the "Language:" header and use a font or encoding scheme that gives
>you the character sets you need.  Just because I post a message
>encoded in ASCII doesn't mean you can't follow up in JIS, so long as
>your posting server converts my text and sets the correct "Encoding:"
>header.

The problem is that there may not be any standard encoding for a specific
language. For exemple, an article written in Swedish may have any of
at least three different encodings, all being different "standards".
The language don't give you enough information to choose encodings, and
if it is the encoding you're after anyway, why don't specify it in the header.
-- 
     _
Mats Ohrman     Scandinavian System Support AB       E-mail:   matoh@sssab.se
                Box 535    _                      Telephone:  +46 13 11 16 60
                581 06 Linkoping, Sweden            Telefax:  +46 13 11 51 93

root@questor.wimsey.bc.ca (Postmaster) (01/03/91)

rad@railnet.UUCP (Rick DeMattia) writes:

> Out of curiosity, how do you implement the aforementioned characters?  
> IBM-PC character-set?  Is your method of implementation standard across 
> platforms?

From the content of the following message, there appears to be a useful
implementation around the corner, and there is a good possibility that
it will indeed be standard across all platforms.  Maybe we should all get
rid of our PC's, Mac's, Amigas, et alia, and all get NeXT's??  :-)


> From: keld@login.dkuug.dk (Keld J|rn Simonsen)
> Newsgroups: news.software.b,alt.bbs
> Subject: Re: New USENET header: Language
> Message-ID: <keld.662759787@dkuugin>
> Date: 1 Jan 91 19:56:27 GMT
> References: <L9P1u1w163w@questor.wimsey.bc.ca> <DsZ3u2w163w@railnet.UUCP>
> Sender: news@slyrf.dkuug.dk
> Followup-To: news.software.b

> There is an ISO standard ISO 639:1988 which contains all the languages
> in the world with their English, French and native names - and
> a standard two-letter abbrevation. I will suggest that the 
> abbrevation should be used in this proposed header.

> An exerpt of the ISO 639 standard is available by anon ftp in
> dkuug.dk:i18n/ISO_639 - it has all abbrevations and the English
> name listed, but no French nor native names.

> Concerning the character set, there was a discussion on the
> news early last year (1990!) - where Kim Storm (the author of nn)
> announced that support for extended character sets was planned for nn.
> I think he is still planning this, but the implementation is becoming
> nearer.

> Kim Storm planned on using some software I have developed. 
> This SW can handle (in the current release) some 60 different character
> sets and represent in each character set all of the others via some
> quite mnemonic encoding. Thus you can have presented what your
> equipment is capable of, and the rest can be deciphered too,
> without loss of information. You can use an encoded ASCII as the
> transport character set, this should pass any news agent in the world.

> I am employing this SW at dkuug.dk - the Danish Internet mail
> backbone and uucp gateway, via a sendmail 5.64 + IDA implementation.
> I there have implemented the headers X-Charset: and X-Char-Esc:
> to specify the encoding. I am also working together with Dan Oscarsson
> on doing a truly ISO 10646 based sendmail.

> The Sendmail and character set SW is available from dkuug.dk:pub

> Keld Simonsen

Many thanks for Keld for his timely response!!

---
   Steve Pershing, System Administrator

| The QUESTOR PROJECT - Free Usenet News/Internet Mail; Sci, Med, AIDS, more |
|----------------------------------------------------------------------------|
| Usenet:  sp@questor.wimsey.bc.ca      |  POST: 1027 Davie Street,  Box 486 |
| Phones:  Voice/FAX:  +1 604 682-6659  |        Vancouver, British Columbia |
|          Data/BBS:   +1 604 681-0670  |        Canada  V6E 4L2             |

spike@world.std.com (Joe Ilacqua) (01/03/91)

In article <kd1oq1eo7C@herkules.sssab.se> matoh@sssab.se (Mats Ohrman) writes:
<spike@world.std.com (Joe Ilacqua) writes:
<>	As for multiple languages, list all the languages and dialects
<>in the "Language:" header and use a font or encoding scheme that gives
<>you the character sets you need.  Just because I post a message
<>encoded in ASCII doesn't mean you can't follow up in JIS, so long as
<>your posting server converts my text and sets the correct "Encoding:"
<>header.
<The problem is that there may not be any standard encoding for a specific
<language. For exemple, an article written in Swedish may have any of
<at least three different encodings, all being different "standards".
<The language don't give you enough information to choose encodings, and
<if it is the encoding you're after anyway, why don't specify it in the header.

	The way I look at it "Language:" is to allow humans to decide
if they can and want to read a message.  "Encoding:" specifies how
that language is encoded.  The fact that there are three ways to
encode Swedish does not change that fact that they language is
Swedish.

->Spike
-- 
The World - Public Access Unix - +1 617-739-9753  24hrs {3,12,24,96,192}00bps

rees@pisa.ifs.umich.edu (Jim Rees) (01/03/91)

In article <1990Dec30.095938.23011@lth.se>, Dan@dna.lth.se (Dan Oscarsson) writes:

  ISO 10646 defines sequences for switching character length and subset of
  the total set. This allows letters using ASCII or ISO 8859-1 to be sent
  exactely as today with not one character more. ISO 10646 can use 8,16,24 and
  32 bits per character and can dynamically change between them so the
  standard defines a way to efficiency send text.

I'll have to plead guilty of getting us off the track.  Brad just wanted
some way to skip articles written in French because he doesn't read French.
I now understand that this is orthogonal to the issue of character sets.

I think we should:

1.  Introduce a "Language:" header.  This wouldn't affect how your news
reader displays text.  You could put "Language: French" in your kill file if
you want.

2.  Adopt a standard for character set encoding.  This should be easy, since
there are so many to choose from.  ISO 10646 looks good to me, if it isn't
too hard to implement.

3.  For those of us with X terminals, modify the Athena text widget so that
it understands and can display the encoding selected in (2) above.  This
widget could then be plugged in to xrn or your favorite X-base news reader.
Could be plugged in to your mailer, too.  Unfortunately xterm doesn't use
the text widget.  Maybe someone already has a ISO 10646 widget -- anyone
know?

4.  People with ASCII terminals will need a reader that transliterates to
ASCII using some kind of table.  I'm not going to worry about this.

5.  I don't know what to do about input.  I assume there are standards for
this too.

6.  Per Brad's suggestion, we should start sending binary stuff as 8 bit
binary.  News propogates it just fine.  It's a horrible waste of bandwidth to
uuencode all that stuff.  We'll want to fix our readers/posters to deal with
this.

brad@looking.on.ca (Brad Templeton) (01/03/91)

In article <4ef9aa39.1bc5b@pisa.ifs.umich.edu> rees@citi.umich.edu (Jim Rees) writes:
>I'll have to plead guilty of getting us off the track.  Brad just wanted
>some way to skip articles written in French because he doesn't read French.
>I now understand that this is orthogonal to the issue of character sets.

Well, I can read French ... mostly, and certainly not as well as I could
when I left school, however it does take a lot more work so I might still
want to skip such articles.  I can't read Italian, however, and the group
trial.soc.culture.italian is full of postings in that language.


However, I would be upset if my suggestion of a human-language header
caused the creation of an encoding header that supported large numbers
of character set encodings.

If we allow multiple encodings we only complicate all the software, or
create a set of postings that can't be read very widely by the computers,
not just the humans.

One or two official encodings beyond ASCII, at best, are all we should
dare.

In fact, I would try to kill many birds with one stone and say that
the extra encoding should be a rich text format that supports, aside
from unusual characters, formatting codes etc.   Perhaps even the
microsoft RTF with international charset extensions as we wish.

If somebody writes a nice decoding pager for this format, then
all newsreaders can use that at the very least for such files, even
if it's slow to call a remote program.  We don't want to generate a
large set of postings that lots of people can't even decode because
their software doesn't support 10 otherwise isomorphic encoding methods.

In summary, multiple encoding methods should be defined only when they
provide new features.  There is no point in supporting two ways of doing
the same thing.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

harald.alvestrand@elab-runit.sintef.no (01/04/91)

Just to get everybody stirred up:
It might be a good idea to look at what other people are doing.

The X.400(88) standards specify a header-extension (header to all you folks)
that is a set of languages, each identified by its ISO 639.2 two-letter code,
that "identifies the languages used in the composition of the message's subject
field and body". "If the extension is absent, the languages shall be considered
unspecified", to quote the legalese.

They also specify that the body of the message contains a sequence of
body parts,
each of which may have a different encoding. Definition of body parts is a
"free-for-all", with IA5 text (ASCII) being far and away the most common.
(Anybody can now allocate an identifier for a body part)
My personal guess for the winner in the "rich text" sweepstakes is ODA, also
an ISO standard.

If we want to crash all our newsreaders, we might as well aim at something
that *at least* covers this functionality :-)

(I did NOT put a followup-to: to comp.protocols.iso.x400 - but further
discussion of the X.400(88) standard may belong there. Impact on News belongs
here, I think...)


                   Harald Tveit Alvestrand
Harald.Alvestrand@elab-runit.sintef.no
C=no;PRMD=uninett;O=sintef;OU=elab-runit;S=alvestrand;G=harald
+47 7 59 70 94

eps@toaster.SFSU.EDU (Eric P. Scott) (01/05/91)

In article <1991Jan4.082150.5895@ugle.unit.no>
	harald.alvestrand@elab-runit.sintef.no writes:
>My personal guess for the winner in the "rich text" sweepstakes is ODA, also
>an ISO standard.

Would someone please comment on the reference in RFC 1197?

					-=EPS=-

anoosh@mips.COM (Anoosh Hosseini) (01/05/91)

It may be worthwhile to step back, and look at the
global picture. What is the mission?  Internationalized News readers? OK. Lets
assume this is the goal.  In my opinion, we may want to look at the problem at
3 layers.  The bottom layer is pure data, here we all agree that we need to
move to an environment where 8 bit data can be created, and sent  
smoothly across all systems. This will allow any type of encoding, including
multi-byte encodings.  The next layer is transportation and management of 
articles. This already exists and the only thing worth mentioning is that
we agree that all header format/information remain in ASCII English.
The top layer is presentation and the area needing most rework.  So far we have 
stayed with the ASCII display terminal as the common denominator.  

In this new environment, articles will no longer just be in French, German, 
or Italian, but some really exotic ones which require much more sophisticated
support. So two trains of thought come into mind, one for the existing ASCII
News readers to tolerate articles which may have non-ASCII message bodies,
and standard methodologies to describe encodings used and resources
needed to post/display multi-lingual articles. 

The new ASCII News reader will need to determine from the additional header
field, if encodings used within the body is one it does not support. 
In such a case, it may wish to skip over the article.  On the other hand a 
News reader supporting  ISO 10646 encoding does not mean it can display every 
character set encapsulated within the standard. It would be nice to to have a 
complete solution, but that may be a ways off.  Due to the research and
background required to implement interfaces for some of these languages,
most internationalization efforts have been localize. That is we may have
a Japanese interface, but it may not do Hebrew or Arabic and vice versa.
So we will need notation to specify subsets of standards.  ISO 10646 
encompasses many other ISO standards such as the 8859-X series.  The latter 
are single byte (8 bit) codes and may be all that is needed for those who 
will at most use 2 languages (english/X). Just specifying 8859-X may be easier
since the presentation layer of that particular News reader will not display 
any other character sets. In any case, anyone hoping to get any attention will 
need to encode in formats that people are reading in. :-)

We have found that Xrn to have a good abstraction model, allowing easy foreign
language support.  The displaying of message body is passed to an X11 entity
(text widget) which is responsible for all screen management and editing. 
The interface is basically a string,  and the text widget can hide all the 
encoding, and display management, leaving the Xrn News reader basically 
intact.


-anoosh

--