[comp.text] Polyglot List Issue

koontz@cam.nist.gov (John E. Koontz X5180) (01/15/91)

Since I cross posted some comp.text contributions to the Polyglot list
which inspired replies, I am posting the replies and some supporting 
material here for the benefit of the original posters.

Date: Fri, 11 Jan 91 13:57:31 -0600
To: Polyglot@tira.uchicago.edu
From: Polyglot-request@tira.uchicago.edu
Subject:  Polyglot Digest V2 #2

--------
__________________________ P O L Y G L O T _________________________

    POLYGLOT --  A Mailing List Devoted to Multilingual Computing
	   The Center for Information and Language Studies

Contributions to: 			  polyglot@tira.uchicago.edu
Administrative requests to:	  polyglot-request@tira.uchicago.edu
Anonymous ftp archive:			  tira.uchicago.edu:polyglot
____________________________________________________________________
Polyglot Digest                           Friday, 11 Jan 1991
                      Volume 2 : Issue 2
	
Today's Topics:

                             Administrivia
                        Unicode Progress Report
            International character set requirements needed
                       GNU Emacs and 8-bit text
                             smtp interest
                     8-bit cleaning, Unicode, etc.

------------------------------------------------------------

Date:    Fri, 11 Jan 91 13:45:01 -0600
From:    scott@sage.uchicago.edu (Scott Deerwester)
Subject: Administrivia

Well!  It appears that the only problem with Polyglot was that most
people forgot about it!  The distribution of issue 2/1 prompted a
number of responses to the international character set requirements
discussion, which form the bulk of issue 2.  Two announcements
complete the issue.  First, John Koontz forwards a Unicode progress
report from Asmus Freytag.  The final article is an announcement about
work that der Mouse is doing on GNU emacs and 8-bit characters.

Enjoy!  And *please* contribute!  Submissions from various news groups
are welcome.

	Scott Deerwester
	Center for Information and Language Studies
	University of Chicago

...

------------------------------

Date:    Thu, 10 Jan 91 11:53:37 -0800
From:    Tom McFarland <tommc@hpcvlx.cv.hp.com>
Subject: International character set requirements needed

Scott,

You might want to put out a correction to V2 #1.  Both Unicode and ISO
10646 are useful encodings... however, they are not similar or closely
related as various posters in V2#1 indicate.

Tom McFarland
Hewlett-Packard, Co.
Interface Technology Operation
Internationalization Team
<tommc@cv.hp.com>


>From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill)
>Newsgroups: comp.text
>
>keld@login.dkuug.dk (Keld J|rn Simonsen) writes:
>
>This is what Unicode is for.  Unicode should be considered the most
>useful and implementable subset of the draft standard ISO 10646.

Unicode can be no stretch of the imagination be considered a subset
(either proper or improper) of ISO 10646.  While both attempt to
address the same objective, their similarities end there.  ISO 10646
is standard being developed by official national representatives;
Unicode is a grass roots based, competing code set being proposed by a
group of vendors.

> ...  The reason 16 bits are enough is that Asian pictographs which
>everyone would recognize as the same have been unified.  Thus, more
>than 31,000 characters have been reduced to about 20,000 slots.

Not everyone recognizes this.  In fact, enough people disagree with
this as to vote down exactly this proposed change in the ISO group
drafting 10646.  As I remember it, Japan was the major opponent to
this modification.


>------------------------------
>
>Date:    Fri, 04 Jan 91 09:44:29 -0700
>From:    koontz@alpha.bldr.nist.gov (John E. Koontz)
>Subject: International character set requirements needed
>
>Forwarded message follows:
>
>From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill)
>Newsgroups: comp.text
>
>keld@login.dkuug.dk (Keld J|rn Simonsen) writes:
>>
>> Is UNICODE a true subset of ISO 10646?
>> Is there a well defined relation between ISO 10646 encoding and UNICODE?
>
>ISO 10646 is still in draft form.  Both questions are impossible to
>answer until 10646 gets finalized.

The draft is fairly stable, and the questions are not that difficult
to answer.  UNICODE is not a true subset of ISO 10646 - the two
encoding methods are similar only in their attempt to address the same
problem set.  As for there being a well defined relation between 10646
and Unicode... the author answers his own question in the trailing
paragraphs: there is not a one-to-one mapping and data may be lost
converting between the two.

>Disclaimer: I'm not an expert in this area.
>However, extrapolating from what I know, it appears that Unicode
>could be considered a 16-bit implementation of 10646.  The ISO 10646
>draft standard appears to permit 16-bit implementations of any subset
>thereof, for use in process code or communication.

ISO 10646 is very specific in the forms of use allowed.  One key
difference that comes to mind is that ISO prohibits assigning
character to row/column/plane/group values in the range 0x00-0x20,
0x7f-0xA0, and 0xff.  ISO has done this in an attempt to maintain some
level of backwards compatibility with hardware/software that recognize
these values as control codes.  Unicode actively uses these values to
achieve its compactness.

>It just so happens that Unicode covers all Asian characters
>enumerated by existing national standards, plus characters from
>languages that the 10646 draft hasn't even thought about.  So it may
>be a subset, but a largely complete subset.

>There have been attempts to convert Unicode to 10646 and back again,
>I believe with mostly good results.  Of course, some data may be lost
>in the translation.

------------------------------

Date:    Thu, 10 Jan 91 21:04:56 -0500
From:    der Mouse <mouse@lightning.McRCIM.McGill.EDU>
Subject: GNU Emacs and 8-bit text

I've been sort of wondering about a good place to mention this, and
today's polyglot digest reminded me of its existence :-)

I have extended the display support in GNU emacs 18.55.95 to support
display of 8-bit text.  (I have offered my changes to Stallman, but he
tells me that version 19 already addresses the problem, and he'd
rather work on getting 19 out than on updating 18.*.)  The changes can
actually be used for other things as well, as you'll see from the
description below....

The changes eliminate the ctl-arrow variable and create two new
functions:

set-chardisp

  Set the way a character displays in the current buffer (or set the
  default).  The first argument is the character whose display is to
  be set, or nil; the second is the string it is to display as, or
  nil.  (Each character of this string is assumed to occupy one screen
  position.)  If the third argument is omitted or is nil, the current
  buffer's display is set; if it's a buffer, that buffer's display is
  set; otherwise, the default display is set.  If the character is
  nil, all 256 entries of the table are set; if the string is nil, the
  display is set to the default (for a buffer, it uses the default
  value; for the default, the built-in default display is restored).
  Passing nil as both of the first two arguments works sensibly.

get-chardisp

  Get the way a character displays in the current buffer (or the
  default).  The first argument is the character whose display string is
  to be returned, or nil.  If the second argument is omitted or is nil,
  the current buffer's display is returned; if it's a buffer, that
  buffer's display is returned; otherwise, the default display string
  (used for buffers that haven't specifically set a string, or for
  contexts where no buffer is readily available) is returned.  If the
  character is non-nil, that character's display string is returned; if
  not, a 256-element vector is returned, listing all the display strings
  for the buffer (or default) requested.  The returned value is always a
  copy; modifying it will not affect the display.  Use set-chardisp to
  change the display.

and two new variables

default-special-tab-display

  Default special-tab-display for buffers that do not override it.
  This is the same as (default-value 'special-tab-display).

special-tab-display

  Display tabs by moving to tab stops (as opposed to displaying as
  control-I).  Non-nil means to display by tabbing; nil means to
  display tabs as if they were any other control character.
  Automatically becomes local when set in any fashion.

(The idea is that if your display device can display 8-bit text
directly, you use set-chardisp to set each of the high-half characters
to display as itself (ie, as a one-character string); if not, you can
do things like making e-acute display as <e'> and o-slash as <o/>.)

The diffs are under 24K.  I have not yet gotten around to doing the
corresponding things to the input code.  I can mail the diffs, put
them up for anonymous ftp, or even post them somewhere.  Let me know
what you think should be done (unless, of course, you don't care at
all :-).  I also have not written any lisp code to use the new
primitives.

					der Mouse

			old: mcgill-vision!mouse
			new: mouse@larry.mcrcim.mcgill.edu

------------------------------

Date:    Fri, 11 Jan 91 10:07:12 +0000
From:    Glenn.Wright@UK.Sun.COM (Glenn Wright - Sun EHQ - Mktg)
Subject: smtp interest

Keld,

I noted your posting to polyglot, re: sendmail issues.  Were you aware
that the IETF (Internet Engineering Task Force) is currently studting
the means by which non ASCII mail can be sent using the SMTP protocol?
I believe they are working on an extended SMTP form (ESMTP). I wonder
if this will be successful?

Glenn Wright,
Sun Microsystems.

Here is some information on the group:

IETP
- ------

The goals of the group are the following:

  o Incorporate compatibility with the new host requirements document.

  o Allow binary data in message bodies and remove line length restrictions

  o Allow command pipelining (batched smtp)

  o enhance the maintainability/management of mail systems

  o Draft a managment information base for use by network managment
    systems.

  o and perhaps expand the header alphabet somewhat.

Things the group does not intended to do:

  o attempt to mimic the functionality of X.400

  o produce a major re-write of the rfc821/822 mail format

  o make changes to the header structure.



Our strategy is develop an update of rfc1154 (content-type header) to
better meet the needs of having multiple character sets and encodings.
Basically we want to seperate out the notion of content type versus
the encoding of that content type.  This should allow gateways between
binary and non-binary capable mail systems to make intelligent choices
about encoding data.  We'll also most likely formalize the
content-length header field.  We would then encourage people to
publish documents (rfcs) describing the data types and encodings which
they wish to use.

Members of the group will be working on formalization of the Text-Hex
encoding scheme.  This encoding scheme allows for the representation
of 8 bit characters as an ascii escape sequence.  This should allow a
variety of additional character types in mail headers without the need
for changing the header specifications.  Also this could encoding
could be used on the bodies of messages that are "mostly" 7bit
character sets.  Several european languages fall into this area.

------------------------------

Date:    Fri, 11 Jan 91 17:26:45 +0100
From:    macrakis@gr.osf.org
Subject: 8-bit cleaning, Unicode, etc.

A few comments on the international character set discussion:

1) Converting existing 7-bit programs or protocols to be 8-bit clean
   is almost always `trivial', in some sense, but it does require a
   non-negligeable amount of work and even thought.  If one subroutine
   uses the top bit to mark some property or uses a particular
   character as a sentinel, it has to be first identified and then
   fixed.  And then some other way has to be found to represent the
   character properties or string length--this may have secondary
   effects in many places.

   I'm thinking of both the troff and the sendmail/SMTP discussion.

2) troff, the Unix variant of the CTSS runoff program (1963!!) is
   ancient technology, and making it 8-bit clean strikes me as a
   rear-guard action.

3) The OSF/1 system is 8-bit clean (it may even have a clean troff).

4) GNU Emacs, unlike vi (which was a notorious user of the 8th bit, in
   fact), has <<always>> been 8-bit clean.  In fact, you can use it to
   edit binary files!  On the other hand, only the latest versions
   allow you to type in and display 8-bit graphic characters

5) There does NOT appear to be a clean, standard, and reasonable way
   to specify which character set you're using in a given file nor any
   way to switch character sets within a file.

6) Latin-1 does indeed cover all of Western Europe, but does <<not>>
   cover Greek, and therefore does not cover all the EEC.

7) The ISO 646 alternate national characters are handy for unilingual
   environments, but are a disaster for multilingual environments.

8) Unicode seems very nice.  Characters are a fixed 16 bits, which
   greatly simplifies processing.  However, there is the notion of
   diacritical marks (accents, vowel points, etc.), any number of
   which may follow a base character.  Messier to handle (I do not
   have the full spec so I don't know how it's done) are the several
   double diacritics (modifying two characters at a time).  Of course,
   most programs do not care.

9) I have not seen ISO 10646, but it seems crazy to go from a
   fixed-width 16-bit character set to a variable-width character set
   just to represent the same Chinese character multiple times.

------------------------------

End of Polyglot Digest
**********************

gisle@ifi.uio.no (Gisle Hannemyr) (01/15/91)

>  Date:    Fri, 11 Jan 91 17:26:45 +0100
>  From:    macrakis@gr.osf.org
>  Subject: 8-bit cleaning, Unicode, etc.
>
>   6) Latin-1 does indeed cover all of Western Europe, but does <<not>>
>     cover Greek, and therefore does not cover all the EEC.

Latin-1 (ISO 8859/1) does NOT cover "lappish", which is a minority language
used by the Lapps in Norway, Sweden and Finland.  As the Nordic countries
are usually considered part of Western Europe, Latin-1 does not cover all
of Western Europe.

Lappish is covered by Latin-4 (ISO 8859/4).


--
Disclaimer: The opinions expressed herein are not necessarily those of my
            employer, not necessarily mine, and probably not necessary.
 
- gisle hannemyr  (Norwegian Computing Center)
  EAN:   C=no;PRMD=uninett;O=nr;S=Hannemyr;G=Gisle (X.400 SA format)
         gisle.hannemyr@nr.no                      (RFC-822  format)
  Inet:  gisle@ifi.uio.no
  UUCP:  ...!mcsun!ifi!gisle
------------------------------------------------

yfcw14@castle.ed.ac.uk (K P Donnelly) (01/15/91)

>>   6) Latin-1 does indeed cover all of Western Europe, but does <<not>>
>>     cover Greek, and therefore does not cover all the EEC.

>Latin-1 (ISO 8859/1) does NOT cover "lappish", which is a minority language
>used by the Lapps in Norway, Sweden and Finland.  As the Nordic countries
>are usually considered part of Western Europe, Latin-1 does not cover all
>of Western Europe.

>Lappish is covered by Latin-4 (ISO 8859/4).

Latin-1 does not cover Welsh, which has something like 300 thousand or
400 thousand speakers.  Wales is both in the EEC and in Western Europe.

In fact Welsh is not covered by *any* of the parts of ISO 8859.  The
problem is that Welsh has accents not only on the vowels  a e i o u
but also on the semivowels  w  and  y, and some of them are important.

Welsh is included in the mechanism of ISO 6937, which is based on
Teletext, allowed in X.400 headers, and which uses non-spacing
"floating" accents to code accented characters in two bytes (often
making processing difficult).  However, a proposal to ammend the
ISO 6937 *repertoire* to include Welsh was recently voted down.

   Kevin Donnelly

Philippe.Deschamp@Nuri.INRIA.Fr (01/18/91)

In article <6600@alpha.cam.nist.gov>, koontz@cam.nist.gov (John E.
Koontz X5180) writes:
|> From:    macrakis@gr.osf.org
|> Subject: 8-bit cleaning, Unicode, etc.
...
|> 6) Latin-1 does indeed cover all of Western Europe, but does <<not>>
|>    cover Greek, and therefore does not cover all the EEC.

In article <GISLE.91Jan14232159@kyrre.uio.no>, gisle@ifi.uio.no (Gisle
Hannemyr) replies:
|> Latin-1 (ISO 8859/1) does NOT cover "lappish", which is a minority language
|> used by the Lapps in Norway, Sweden and Finland.  As the Nordic countries
|> are usually considered part of Western Europe, Latin-1 does not cover all
|> of Western Europe.

In article <7828@castle.ed.ac.uk>, yfcw14@castle.ed.ac.uk (K P Donnelly) adds:
|> Latin-1 does not cover Welsh, which has something like 300 thousand or
|> 400 thousand speakers.  Wales is both in the EEC and in Western Europe.

   Latin-1 does not cover the french language, which is used in France
and other countries (maybe mainly in other countries :-).  It lacks
the "oe" ligatures (\oe and \OE of TeX).  Last time I looked at a map,
France was in Western Europe...
					Philippe Deschamp.
Tlx: 697033F   Fax: +33 (1) 39-63-53-30   Tel: +33 (1) 39-63-58-58
Email: Philippe.Deschamp@Nuri.INRIA.Fr   ||   ...!inria!deschamp
Smail: INRIA, Rocquencourt, BP 105, 78153 Le Chesnay Cedex, France

tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) (01/19/91)

koontz@cam.nist.gov (John E. Koontz X5180) forwards POLYGLOT:
			(Please forward this to POLYGLOT.)

> ------------------------------
> From:    Tom McFarland <tommc@hpcvlx.cv.hp.com>
> 
> ISO 10646 is standard being developed by official national representatives;
> Unicode is a grass roots based, competing code set being proposed by a
> group of vendors.

Proposals are currently underway to make Unicode one of the accepted
code planes in 10646, and to create a 10646U (U for Unicode) compaction
form using only 16 bits, rather than the 32 bits required by 10646.
The Unicode Consortium has no wish to compete with ISO 10646, and would
prefer to work with ISO toward a truly useful standard.

> there is not a one-to-one mapping and data may be lost
> converting between the two.

Mainly for the reason that Unicode includes many languages that 10646
does not represent.  Both Unicode and 10646 fully represent all ISO 8859
and all existing Chinese/Japanese/Korean national standards.

I believe that C/J/K unification is the right thing to do.  Consider
what the world would be like if English-speaking people insisted on
having their own A-Za-z alphabet, separate from Spanish A-Za-z.  This
is exactly that East Asian countries are doing.

> ISO 10646 is very specific in the forms of use allowed.  One key
> difference that comes to mind is that ISO prohibits assigning
> character to row/column/plane/group values in the range 0x00-0x20,
> 0x7f-0xA0, and 0xff.  ISO has done this in an attempt to maintain some
> level of backwards compatibility with hardware/software that recognize
> these values as control codes.  Unicode actively uses these values to
> achieve its compactness.

Unicode does leave empty slots for ASCII and ISO 8859 control codes.
It's that sufficient?  I don't understand the purpose of leaving any
more empty slots than those.  Perhaps someone knowledgeable about ISO
10646 could enlighten me.

ath@linkoping.telesoft.se (Anders Thulin) (01/27/91)

In article <1840@seti.inria.fr> Philippe.Deschamp@Nuri.INRIA.Fr writes:

>   Latin-1 does not cover the french language, which is used in France
>and other countries (maybe mainly in other countries :-).  It lacks
>the "oe" ligatures (\oe and \OE of TeX).  Last time I looked at a map,
>France was in Western Europe...

Considering that the OE ligature isn't used in *any* if the 8859/1-8
tables, I can't help wondering if it really is an important character.

Perhaps it's a plot to force France out of Western Europe :-)


-- 
Anders Thulin       ath@linkoping.telesoft.se 
Telesoft Europe AB, Teknikringen 2B, S-583 30 Linkoping, Sweden

jaap@mtxinu.COM (Jaap Akkerhuis) (01/29/91)

In article <4947@srava.sra.co.jp> erik@srava.sra.co.jp (Erik M. van der Poel) writes:
 > Bill Tuthill writes:
 > > I believe that C/J/K unification is the right thing to do.  Consider
 > > what the world would be like if English-speaking people insisted on
 > > having their own A-Za-z alphabet, separate from Spanish A-Za-z.
 > 
 > I don't think that is a very good analogy. [stuff deleted]

Maybe it is actually a good analogy. A lot of people actually
consider the wiggly line above the n in spanish as a separate
character, so Bill would like to see that omitted in spanish? :-). It is very
difficult to make proper judgments about other people character sets when one
doesn't speak the language. For instance, all scandinavian characters
look a like for most outsiders. But actually, there are quite some
differences depending whether one speaks Danish, Swedish or
Norwegian.

The C/J/K unification is not the right thing to do when either C, J or
K have severe complaints about it.

	jaap

lee@sq.sq.com (Liam R. E. Quin) (01/30/91)

ath@linkoping.telesoft.se (Anders Thulin) writes:
> Philippe.Deschamp@Nuri.INRIA.Fr writes:
>>   Latin-1 does not cover the french language [...].  It lacks
>>the "oe" ligatures (\oe and \OE of TeX).

>Considering that the OE ligature isn't used in *any* if the 8859/1-8
>tables, I can't help wondering if it really is an important character.

Well, it is used in English in imported words such as [oe]illade (an amorous
look or glance) and [oe]uvre (the works of an artist, painter, etc.).  In the
same way, [ae] is used in Encycolp[ae]dia, Medi[ae]val, [ae]gis, and in names
such as [Ae]lfwin, [Ae]lfric, etc.

Perhaps as these standards mature we'll see them becoming more widely useful.
Or maybe the various inaccessible glyphs will simply not be used, and will
fade away like a snark or a booju...   :-(


Lee

-- 
Liam R. E. Quin,  lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337
    ``No question is so difficult to answer as that to which the answer
      is obvious''				 -- George Bernard Shaw

enag@ifi.uio.no (Erik Naggum) (01/31/91)

Philippe,

The oe ligature is precisely a ligature, very much unlike the ae used
in Denmark, Norway, and Iceland (perhaps others).  oe is not a single
character any more than the ligatures, fi, fl, ffi, and ffl are.  oe
does not influence collation order, but is spelled out.  Notice that
it's not an error to write "oeuvre" instead of "<oe>uvre" (where <> is
used to denote a ligature).  It's very much an error to write "aere"
instead of "<ae>re" in Norwegian, not uncommon but still gross ASCII
abuse to the contrary notwithstanding.

I've gleamed this gem from participation in numerous mailing lists,
but I've forgot precisely which one.  Most probably from the ISO 10646
list on BITNET somewhere.

--
[Erik Naggum]	Snail: Naggum Software / BOX 1570 VIKA / 0118 OSLO / NORWAY
		Mail: <erik@naggum.uu.no>, <enag@ifi.uio.no>
My opinions.	Wail: +47-2-836-863	Another int'l standards dude.

tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) (01/31/91)

erik@srava.sra.co.jp (Erik M. van der Poel) writes:

> In ISO 10646, it is easy to mix Japanese and Chinese in one sentence.
> Can it be done in Unicode? (I ask, because I don't know.)

Just switch fonts from Kanji to Hanzi.  This is similar to what you
would do if you wanted to print the title of a book-- you'd switch
from roman to italics.  Unicode assumes that small differences between
Han characters (of identical meaning) is a font issue, not a character
coding issue.  Different fonts are favored in Japan and in China, just
as Times is popular in the US and Garamond is popular in France.

> However, Han unification may be quite useful.

Amen!  It's interesting that the Japanese delegation voted against
DIS 10646 because it didn't include Han unification.

Bill

ath@linkoping.telesoft.se (Anders Thulin) (01/31/91)

In article <1991Jan29.200653.23928@sq.sq.com> lee@sq.sq.com (Liam R. E. Quin) writes:
>ath@linkoping.telesoft.se (Anders Thulin) writes:
>>Considering that the OE ligature isn't used in *any* if the 8859/1-8
>>tables, I can't help wondering if it really is an important character.
>
>Well, it is used in English in imported words such as [oe]illade (an amorous
>look or glance) and [oe]uvre (the works of an artist, painter, etc.).  In the
>same way, [ae] is used in Encycolp[ae]dia, Medi[ae]val, [ae]gis, and in names
>such as [Ae]lfwin, [Ae]lfric, etc.

I should have used the word 'indispensable' instead. I doubt it is so
in English - all dictionaries I have consulted use the separate forms
as headwords - the ligatures are occasionally listed as alternative
spellings. 

My problem was with French - a language I don't know. Is the <oe>
ligature really indispensable - I can't help thinking it would have
made its way into the Latin-1/... code tables if it was. Is
`chef-d'<oe>uvre' the only way to spell that word?

-- 
Anders Thulin       ath@linkoping.telesoft.se 
Telesoft Europe AB, Teknikringen 2B, S-583 30 Linkoping, Sweden

amanda@visix.com (Amanda Walker) (02/02/91)

I object to the idea of leaving out <oe> because it is "unnecessary."
It is still useful, especially when representing printed texts
accurately.  In a similar vein, I'd like to see glyphs for "long s,"
ligatures such as "ct," "st," and so on, without having to resort
to private encodings.
--
Amanda Walker
Visix Software Inc.

jeffrey@cs.chalmers.se (Alan Jeffrey) (02/03/91)

In article <723@castor.linkoping.telesoft.se> ath@linkoping.telesoft.se (Anders Thulin) writes:
>I should have used the word 'indispensable' instead. I doubt it is so
>in English - all dictionaries I have consulted use the separate forms
>as headwords - the ligatures are occasionally listed as alternative
>spellings. 

The problem is that even when writing in English, you frequently need
oe and ae ligatures, and even some accented letters.  When?  Precisely
whey you are talking about the English language itself.  This
discussion, for example, couldn't be typeset by a system that used
Latin1.  

If Latin1 is proposed as a standard for (for instance) encoding the
textual material of published books, it's going to have to cope with
people (for example historians, or linguists, or lit. critters) who
need to be able to quote texts from more than 30 years ago.  

And as M\'\i che\'al \'O Searc\'oid pointed out at TeX90, Latin1
doesn't cover Irish, which has some accented constonants.

Oh, and my Chambers 20th Century dictionary lists the following words
beginning \ae\ or \oe\ which don't have ae and oe variants:

   \ae sc              (the O.E. letter `ash' now written \ae!)
   \oe il-de-b\oe uf   (a little round window)
   \oe illade          (an ogle)
   
None of these are marked as foreign or obsolete.  Of course this was
eight years ago, things may be different now...

>My problem was with French - a language I don't know. Is the <oe>
>ligature really indispensable - I can't help thinking it would have
>made its way into the Latin-1/... code tables if it was. Is
>`chef-d'<oe>uvre' the only way to spell that word?

Hmm, an interesting idea---`if it was useful it would be in the
standard'.  Ahh, if only ISO worked that way...

Cheers,

Alan.
-- 
Alan Jeffrey         Tel: +46 31 72 10 98         jeffrey@cs.chalmers.se
Department of Computer Sciences, Chalmers University, Gothenburg, Sweden

ath@linkoping.telesoft.se (Anders Thulin) (02/03/91)

In article <1991Feb1.231640.3959@visix.com> amanda@visix.com (Amanda Walker) writes:
>I object to the idea of leaving out <oe> because it is "unnecessary."
>It is still useful, especially when representing printed texts
>accurately.  In a similar vein, I'd like to see glyphs for "long s,"
>ligatures such as "ct," "st," and so on, without having to resort
>to private encodings.

Of course it is useful - I'm not saying it isn't. It's the degree of
usefulness I'm interested in.

Your examples gives additional light on the topic: is the long s
useful to suficiently many people that it should be placed in a 8-bit
code set?  My reply is no. Similarly with ct, st, ffi, fi, and the
rest. They can equally well be represented by the expanded versions.
Only a very small class of people (textual critics) will be
interested, but I believe they have other and better ways of coping
with problems like these.

So, I am asking again: is the <oe> in French only one of these special
ligatures that convey no extra information, or is it a separate
character that *must* be included if the code table should be of any
use for French texts? -- 
Anders Thulin       ath@linkoping.telesoft.se 
Telesoft Europe AB, Teknikringen 2B, S-583 30 Linkoping, Sweden

ath@linkoping.telesoft.se (Anders Thulin) (02/03/91)

In article <4363@undis.cs.chalmers.se> jeffrey@cs.chalmers.se (Alan Jeffrey) writes:
>In article <723@castor.linkoping.telesoft.se> ath@linkoping.telesoft.se (Anders Thulin) writes:
>>[...<oe> left out of Latin-1 due to its being dispensable?...>

>If Latin1 is proposed as a standard for (for instance) encoding the
>textual material of published books, it's going to have to cope with
>people (for example historians, or linguists, or lit. critters) who
>need to be able to quote texts from more than 30 years ago.  

True, but of minor relevance. 8859/1 was (as far as I understand it)
intended for interchange of modern langauges - obsolete and
obsolescent forms were not included. Is *that* why French <oe> isn't
there? (And is that why the y with dieresis is there?)

>Oh, and my Chambers 20th Century dictionary lists the following words
>beginning \ae\ or \oe\ which don't have ae and oe variants:
>
>   \ae sc              (the O.E. letter `ash' now written \ae!)
>   \oe il-de-b\oe uf   (a little round window)
>   \oe illade          (an ogle)
>
>None of these are marked as foreign or obsolete.  Of course this was
>eight years ago, things may be different now...

I'm almost sure you're joking now... 

I have no quarrel with <ae> - it's already in Latin-1. 

The two last words are obviously French (weren't they marked as such?),
which brings me back to the problem I'm really interested in: is <oe>
an indispensable character/glyph of French?

>>ligature really indispensable - I can't help thinking [ <oe> ] would
>>made its way into the Latin-1/... code tables if it was. 
>
>Hmm, an interesting idea---`if it was useful it would be in the
>standard'.  Ahh, if only ISO worked that way...

I can't help thinking that the Latin-1 code table as well as the
others must have been developed in close collaboration with the
national standard bodies. Since the French language appears to be
rather closely monitored I would expect very loud complaints from the
French national standards organizations if characters that were of
vital importance to the modern language weren't included *any* of the
Latin-n code tables, particularly the one that is claimed to cover the
most important Western languages.

Or don't they care? Perhaps there's a FRASCII which makes more sense
to use?)
-- 
Anders Thulin       ath@linkoping.telesoft.se 
Telesoft Europe AB, Teknikringen 2B, S-583 30 Linkoping, Sweden

enag@ifi.uio.no (Erik Naggum) (02/04/91)

Amanda,

Most of these weird ligatures are covered in ISO DIS 10646.  I didn't
find the <ct> ligature in my last scan (got the latest draft in the
mail from Mike Ksar of Hewlett Packard only yesterday).

For those who might be scared of ISO DIS 10646: The default encoding
with one octet compaction method, is equivalent to ISO 8859-1.

--
[Erik Naggum]					     <enag@ifi.uio.no>
Naggum Software, Oslo, Norway			   <erik@naggum.uu.no>

amanda@visix.com (Amanda Walker) (02/05/91)

In article <ENAG.91Feb4003441@holmenkollen.ifi.uio.no>
enag@ifi.uio.no (Erik Naggum) writes:
>Most of these weird ligatures are covered in ISO DIS 10646.

Granted, and I definitely think that 10646 (or something similar,
like Unicode) is the only effective approach for many applications.
It certainly looks like an improvement over ISO IS 2022 :)...

Don't get me wrong--I like the ISO 8859 code sets, as far as they
go.  What I was objecting to was the idea that "archaic" characters
weren't useful to represent in electronic form.  I admit that I
may have overreacted; I just have strong opinions about the matter.

-- 
Amanda Walker						      amanda@visix.com
Visix Software Inc.					...!uunet!visix!amanda
--
Courage is the willingness of a person to stand up for his beliefs in the face
of great odds. Chutzpah is doing the same thing wearing a Mickey Mouse hat.

jeffrey@cs.chalmers.se (Alan Jeffrey) (02/05/91)

In article <ENAG.91Feb4020953@holmenkollen.ifi.uio.no> enag@ifi.uio.no (Erik Naggum) writes:

>Let me take issue with the bizarre idea that ISO 8859-1 should be
>sufficient for typesetting.  

This wasn't what I was arguing---certainly Latin1 shouldn't cover
ligatures, font changes, etc. which are too dependent on typographical
decisions, but it *should* be capable of encoding plain text.  Tables,
mathematics, fonts, blah etc. need special encoding, but any serious
character encoding scheme should be able to handle, say, a dictionary
(minus the pronounciation guide).

In an idealish world, Latin1 would cover the plain text characters
from the Western European languages, plus the standard
typewriter/programming symbols.  Unfortunately there's more than 191
of these, so something's got to give, and it's probably the claim that
Latin1 covers all Western Europe.  Unfortunately, whichever standard
covers the US will probably end up becoming the de facto standard for
international transmission, so we'll end up no better off than today.
Sighh...

Alan.
-- 
Alan Jeffrey         Tel: +46 31 72 10 98         jeffrey@cs.chalmers.se
Department of Computer Sciences, Chalmers University, Gothenburg, Sweden

jeffrey@cs.chalmers.se (Alan Jeffrey) (02/05/91)

In article <727@castor.linkoping.telesoft.se>:
>True, but of minor relevance. 8859/1 was (as far as I understand it)
>intended for interchange of modern langauges - obsolete and
>obsolescent forms were not included. Is *that* why French <oe> isn't
>there? (And is that why the y with dieresis is there?)

The question is---is Latin1 intended for modern languages, or modern
documents?  Documents written today still need to be able to refer to
obsolete usages.  The problem is where you draw the line to say `this
character is strange enough that its usage is completely dead', and
I'm not convinced that \oe\ is that old.  I even saw it on a road
sign in the UK a few months ago.

[About various words beginning `\oe' in Chambers.]

>I'm almost sure you're joking now... 

Well, yes, it wasn't meant particularly seriously, but it does mean
that eight years ago there were still words in English that Chambers
reckoned couldn't be written without \oe.  \OE illade is a bit of a
weirdo, but \oe il-de-b\oe f is a common enough architectural term
that I've heard of it.  And like I said, neither of these are marked
foreign or obsolete.

Oh, as an aside, if I type` \oe' as `oe' in my plain text, how will I
cope with the fact that it should be capitalised as `OEillade' if the
oe is a ligature, and `Oeillade' otherwise? 

Cheers,

Alan.
-- 
Alan Jeffrey         Tel: +46 31 72 10 98         jeffrey@cs.chalmers.se
Department of Computer Sciences, Chalmers University, Gothenburg, Sweden

erik@srava.sra.co.jp (Erik M. van der Poel) (02/05/91)

> > Amen!  It's interesting that the Japanese delegation voted against
> > DIS 10646 because it didn't include Han unification.
> > 
> > Bill Tuthill
> 
> I understand that Japan is also adamantly against Unicode.
> 
> Mark Leisher

Mark, don't believe everything you see on the net (including what I'm
writing now :-).

Bill, could you re-check the accuracy of your article, and, if it is
accurate, please give us the names of the Japanese delegates involved,
or the name of their working group, and any other relevant info such
as document numbers and/or voting date/place.


> Does
> anyone have any current info on anything that is being proposed in
> lieu of ISO/IEC DIS 10646 and Unicode?

Some Japanese have set up an informal group to discuss the possible
extension of Japanese EUC and Shift-JIS to support the new
supplementary Kanji set called JIS X 0212. Some of these proposals
mushroomed into "internationalization" extensions. I may be able to
provide more info if anyone is interested.

(This is not to say that "the Japanese" are against both 10646 and
Unicode.)
-
-- 
Erik M. van der Poel                                      erik@sra.co.jp
Software Research Associates, Inc., Tokyo, Japan     TEL +81-3-3234-2692

lee@sq.sq.com (Liam R. E. Quin) (02/06/91)

amanda@visix.com (Amanda Walker) writes:
>Don't get me wrong--I like the ISO 8859 code sets, as far as they
>go.  What I was objecting to was the idea that "archaic" characters
>weren't useful to represent in electronic form.  I admit that I
>may have overreacted; I just have strong opinions about the matter.

Of course, ligatures like [oe] and [ae] are still considered `correct' in
oeuvre, mediaeval, encyclopaedia, Oedipus, Aethelwine, etc., in the UK --
`archaic' is relative.

In other words, I agree with you strongly!

I think also that the distinction between glyph-name (ae-ligature), glyph
and position in collation sequence must be made clear, especially as
collating sequence varies from nationality to nationality.  Once we get so
far advanced that we can conceive of printing a Welsh dictionary, we'd better
be able to sort the entries correctly :-)

Some of the work on fonts from ISO 9541 might be profitable reading here.

And yes, I'd love a standard position for tall-s, yogh, etc., but the ct
ligature should be inserted automatically in the same way that the ff
ligature is made at the moment in electronic systems.

Lee

-- 
Liam R. E. Quin,  lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337

yfcw14@castle.ed.ac.uk (K P Donnelly) (02/07/91)

em@dce.ie (Eamonn McManus) writes:

>No, modern Irish does not have any accented consonants.  It does require
>the ability to put acute accents on all five vowels, though.  Older Irish
>writing used a dot above a consonant to indicate lenition, which is now
>written as a h after the letter.  But this writing uses a special script
>which is not in Latin1 anyhow.

Agreed, except for the last sentence.  It was actually just a special
font, like the Fraktur formerly used with German.  The character set
standards don't cover fonts.

   Kevin Donnelly

jeffrey@cs.chalmers.se (Alan Jeffrey) (02/07/91)

In article <1991Feb5.174923.16236@sq.sq.com> lee@sq.sq.com (Liam R. E. Quin) writes:
>amanda@visix.com (Amanda Walker) writes:

>I think also that the distinction between glyph-name (ae-ligature), glyph
>and position in collation sequence must be made clear, especially as
>collating sequence varies from nationality to nationality.  Once we get so
>far advanced that we can conceive of printing a Welsh dictionary, we'd better
>be able to sort the entries correctly :-)

Agreed totally---one of the best tests for `is this glyph a separate
letter or just a decorated form of another letter?' is whether it
alphabetizes and/or capitalizes differently.  So French \'a can be
regarded as an a with an accent, but Swedish \"a can't, as it
alphebetizes to the back of the dictionary.  On this basis I'd claim
`\oe' as a separate letter, as it capitalizes to `\OE' whereas `oe'
capitalizes to `Oe'.

In general, collation is much more difficult than it appears---not
only does it vary from language to language (is \"o before or after
p?) but also from application to application (is `McCarthy' before or
after `May'?).  And you should try convincing BibTeX that some
people's surnames come before their given names...

Cheers,

Alan.
-- 
Alan Jeffrey         Tel: +46 31 72 10 98         jeffrey@cs.chalmers.se
Department of Computer Sciences, Chalmers University, Gothenburg, Sweden

lee@sq.sq.com (Liam R. E. Quin) (02/12/91)

lee@sq.sq.com (Liam R. E. Quin) writes:

> I think also that the distinction between glyph-name (ae-ligature), glyph
> and position in collation sequence must be made clear, especially as
> collating sequence varies from nationality to nationality.  [...]

jeffrey@cs.chalmers.se (Alan Jeffrey) writes:

>Agreed totally---one of the best tests for `is this glyph a separate
>letter or just a decorated form of another letter?' is whether it
>alphabetizes and/or capitalizes differently.

Another important point is that if one were (for example) to use "oe"
instead of the oe-ligature in French, one could no longer set those words
which contain o and e as distinct glyphs, such as `coexistence'.

Perhaps typesetting systems could have a ligature-exception table, which
would prevent such errors -- it's not clear to me.
I don't know of any French words which change in meaning if the oe-ligature
is replaced by "oe", but then I don't know French.  Examples of Dutch IJ
capitalisation (which are correct in all of the atlases I own) have been
provided recently on the net, of course.

Oedipus and Aelfwine don't look at all right to me...!

Lee

-- 
Liam R. E. Quin,  lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337