koontz@cam.nist.gov (John E. Koontz X5180) (01/15/91)
Since I cross posted some comp.text contributions to the Polyglot list which inspired replies, I am posting the replies and some supporting material here for the benefit of the original posters. Date: Fri, 11 Jan 91 13:57:31 -0600 To: Polyglot@tira.uchicago.edu From: Polyglot-request@tira.uchicago.edu Subject: Polyglot Digest V2 #2 -------- __________________________ P O L Y G L O T _________________________ POLYGLOT -- A Mailing List Devoted to Multilingual Computing The Center for Information and Language Studies Contributions to: polyglot@tira.uchicago.edu Administrative requests to: polyglot-request@tira.uchicago.edu Anonymous ftp archive: tira.uchicago.edu:polyglot ____________________________________________________________________ Polyglot Digest Friday, 11 Jan 1991 Volume 2 : Issue 2 Today's Topics: Administrivia Unicode Progress Report International character set requirements needed GNU Emacs and 8-bit text smtp interest 8-bit cleaning, Unicode, etc. ------------------------------------------------------------ Date: Fri, 11 Jan 91 13:45:01 -0600 From: scott@sage.uchicago.edu (Scott Deerwester) Subject: Administrivia Well! It appears that the only problem with Polyglot was that most people forgot about it! The distribution of issue 2/1 prompted a number of responses to the international character set requirements discussion, which form the bulk of issue 2. Two announcements complete the issue. First, John Koontz forwards a Unicode progress report from Asmus Freytag. The final article is an announcement about work that der Mouse is doing on GNU emacs and 8-bit characters. Enjoy! And *please* contribute! Submissions from various news groups are welcome. Scott Deerwester Center for Information and Language Studies University of Chicago ... ------------------------------ Date: Thu, 10 Jan 91 11:53:37 -0800 From: Tom McFarland <tommc@hpcvlx.cv.hp.com> Subject: International character set requirements needed Scott, You might want to put out a correction to V2 #1. Both Unicode and ISO 10646 are useful encodings... however, they are not similar or closely related as various posters in V2#1 indicate. Tom McFarland Hewlett-Packard, Co. Interface Technology Operation Internationalization Team <tommc@cv.hp.com> >From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) >Newsgroups: comp.text > >keld@login.dkuug.dk (Keld J|rn Simonsen) writes: > >This is what Unicode is for. Unicode should be considered the most >useful and implementable subset of the draft standard ISO 10646. Unicode can be no stretch of the imagination be considered a subset (either proper or improper) of ISO 10646. While both attempt to address the same objective, their similarities end there. ISO 10646 is standard being developed by official national representatives; Unicode is a grass roots based, competing code set being proposed by a group of vendors. > ... The reason 16 bits are enough is that Asian pictographs which >everyone would recognize as the same have been unified. Thus, more >than 31,000 characters have been reduced to about 20,000 slots. Not everyone recognizes this. In fact, enough people disagree with this as to vote down exactly this proposed change in the ISO group drafting 10646. As I remember it, Japan was the major opponent to this modification. >------------------------------ > >Date: Fri, 04 Jan 91 09:44:29 -0700 >From: koontz@alpha.bldr.nist.gov (John E. Koontz) >Subject: International character set requirements needed > >Forwarded message follows: > >From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) >Newsgroups: comp.text > >keld@login.dkuug.dk (Keld J|rn Simonsen) writes: >> >> Is UNICODE a true subset of ISO 10646? >> Is there a well defined relation between ISO 10646 encoding and UNICODE? > >ISO 10646 is still in draft form. Both questions are impossible to >answer until 10646 gets finalized. The draft is fairly stable, and the questions are not that difficult to answer. UNICODE is not a true subset of ISO 10646 - the two encoding methods are similar only in their attempt to address the same problem set. As for there being a well defined relation between 10646 and Unicode... the author answers his own question in the trailing paragraphs: there is not a one-to-one mapping and data may be lost converting between the two. >Disclaimer: I'm not an expert in this area. >However, extrapolating from what I know, it appears that Unicode >could be considered a 16-bit implementation of 10646. The ISO 10646 >draft standard appears to permit 16-bit implementations of any subset >thereof, for use in process code or communication. ISO 10646 is very specific in the forms of use allowed. One key difference that comes to mind is that ISO prohibits assigning character to row/column/plane/group values in the range 0x00-0x20, 0x7f-0xA0, and 0xff. ISO has done this in an attempt to maintain some level of backwards compatibility with hardware/software that recognize these values as control codes. Unicode actively uses these values to achieve its compactness. >It just so happens that Unicode covers all Asian characters >enumerated by existing national standards, plus characters from >languages that the 10646 draft hasn't even thought about. So it may >be a subset, but a largely complete subset. >There have been attempts to convert Unicode to 10646 and back again, >I believe with mostly good results. Of course, some data may be lost >in the translation. ------------------------------ Date: Thu, 10 Jan 91 21:04:56 -0500 From: der Mouse <mouse@lightning.McRCIM.McGill.EDU> Subject: GNU Emacs and 8-bit text I've been sort of wondering about a good place to mention this, and today's polyglot digest reminded me of its existence :-) I have extended the display support in GNU emacs 18.55.95 to support display of 8-bit text. (I have offered my changes to Stallman, but he tells me that version 19 already addresses the problem, and he'd rather work on getting 19 out than on updating 18.*.) The changes can actually be used for other things as well, as you'll see from the description below.... The changes eliminate the ctl-arrow variable and create two new functions: set-chardisp Set the way a character displays in the current buffer (or set the default). The first argument is the character whose display is to be set, or nil; the second is the string it is to display as, or nil. (Each character of this string is assumed to occupy one screen position.) If the third argument is omitted or is nil, the current buffer's display is set; if it's a buffer, that buffer's display is set; otherwise, the default display is set. If the character is nil, all 256 entries of the table are set; if the string is nil, the display is set to the default (for a buffer, it uses the default value; for the default, the built-in default display is restored). Passing nil as both of the first two arguments works sensibly. get-chardisp Get the way a character displays in the current buffer (or the default). The first argument is the character whose display string is to be returned, or nil. If the second argument is omitted or is nil, the current buffer's display is returned; if it's a buffer, that buffer's display is returned; otherwise, the default display string (used for buffers that haven't specifically set a string, or for contexts where no buffer is readily available) is returned. If the character is non-nil, that character's display string is returned; if not, a 256-element vector is returned, listing all the display strings for the buffer (or default) requested. The returned value is always a copy; modifying it will not affect the display. Use set-chardisp to change the display. and two new variables default-special-tab-display Default special-tab-display for buffers that do not override it. This is the same as (default-value 'special-tab-display). special-tab-display Display tabs by moving to tab stops (as opposed to displaying as control-I). Non-nil means to display by tabbing; nil means to display tabs as if they were any other control character. Automatically becomes local when set in any fashion. (The idea is that if your display device can display 8-bit text directly, you use set-chardisp to set each of the high-half characters to display as itself (ie, as a one-character string); if not, you can do things like making e-acute display as <e'> and o-slash as <o/>.) The diffs are under 24K. I have not yet gotten around to doing the corresponding things to the input code. I can mail the diffs, put them up for anonymous ftp, or even post them somewhere. Let me know what you think should be done (unless, of course, you don't care at all :-). I also have not written any lisp code to use the new primitives. der Mouse old: mcgill-vision!mouse new: mouse@larry.mcrcim.mcgill.edu ------------------------------ Date: Fri, 11 Jan 91 10:07:12 +0000 From: Glenn.Wright@UK.Sun.COM (Glenn Wright - Sun EHQ - Mktg) Subject: smtp interest Keld, I noted your posting to polyglot, re: sendmail issues. Were you aware that the IETF (Internet Engineering Task Force) is currently studting the means by which non ASCII mail can be sent using the SMTP protocol? I believe they are working on an extended SMTP form (ESMTP). I wonder if this will be successful? Glenn Wright, Sun Microsystems. Here is some information on the group: IETP - ------ The goals of the group are the following: o Incorporate compatibility with the new host requirements document. o Allow binary data in message bodies and remove line length restrictions o Allow command pipelining (batched smtp) o enhance the maintainability/management of mail systems o Draft a managment information base for use by network managment systems. o and perhaps expand the header alphabet somewhat. Things the group does not intended to do: o attempt to mimic the functionality of X.400 o produce a major re-write of the rfc821/822 mail format o make changes to the header structure. Our strategy is develop an update of rfc1154 (content-type header) to better meet the needs of having multiple character sets and encodings. Basically we want to seperate out the notion of content type versus the encoding of that content type. This should allow gateways between binary and non-binary capable mail systems to make intelligent choices about encoding data. We'll also most likely formalize the content-length header field. We would then encourage people to publish documents (rfcs) describing the data types and encodings which they wish to use. Members of the group will be working on formalization of the Text-Hex encoding scheme. This encoding scheme allows for the representation of 8 bit characters as an ascii escape sequence. This should allow a variety of additional character types in mail headers without the need for changing the header specifications. Also this could encoding could be used on the bodies of messages that are "mostly" 7bit character sets. Several european languages fall into this area. ------------------------------ Date: Fri, 11 Jan 91 17:26:45 +0100 From: macrakis@gr.osf.org Subject: 8-bit cleaning, Unicode, etc. A few comments on the international character set discussion: 1) Converting existing 7-bit programs or protocols to be 8-bit clean is almost always `trivial', in some sense, but it does require a non-negligeable amount of work and even thought. If one subroutine uses the top bit to mark some property or uses a particular character as a sentinel, it has to be first identified and then fixed. And then some other way has to be found to represent the character properties or string length--this may have secondary effects in many places. I'm thinking of both the troff and the sendmail/SMTP discussion. 2) troff, the Unix variant of the CTSS runoff program (1963!!) is ancient technology, and making it 8-bit clean strikes me as a rear-guard action. 3) The OSF/1 system is 8-bit clean (it may even have a clean troff). 4) GNU Emacs, unlike vi (which was a notorious user of the 8th bit, in fact), has <<always>> been 8-bit clean. In fact, you can use it to edit binary files! On the other hand, only the latest versions allow you to type in and display 8-bit graphic characters 5) There does NOT appear to be a clean, standard, and reasonable way to specify which character set you're using in a given file nor any way to switch character sets within a file. 6) Latin-1 does indeed cover all of Western Europe, but does <<not>> cover Greek, and therefore does not cover all the EEC. 7) The ISO 646 alternate national characters are handy for unilingual environments, but are a disaster for multilingual environments. 8) Unicode seems very nice. Characters are a fixed 16 bits, which greatly simplifies processing. However, there is the notion of diacritical marks (accents, vowel points, etc.), any number of which may follow a base character. Messier to handle (I do not have the full spec so I don't know how it's done) are the several double diacritics (modifying two characters at a time). Of course, most programs do not care. 9) I have not seen ISO 10646, but it seems crazy to go from a fixed-width 16-bit character set to a variable-width character set just to represent the same Chinese character multiple times. ------------------------------ End of Polyglot Digest **********************
gisle@ifi.uio.no (Gisle Hannemyr) (01/15/91)
> Date: Fri, 11 Jan 91 17:26:45 +0100 > From: macrakis@gr.osf.org > Subject: 8-bit cleaning, Unicode, etc. > > 6) Latin-1 does indeed cover all of Western Europe, but does <<not>> > cover Greek, and therefore does not cover all the EEC. Latin-1 (ISO 8859/1) does NOT cover "lappish", which is a minority language used by the Lapps in Norway, Sweden and Finland. As the Nordic countries are usually considered part of Western Europe, Latin-1 does not cover all of Western Europe. Lappish is covered by Latin-4 (ISO 8859/4). -- Disclaimer: The opinions expressed herein are not necessarily those of my employer, not necessarily mine, and probably not necessary. - gisle hannemyr (Norwegian Computing Center) EAN: C=no;PRMD=uninett;O=nr;S=Hannemyr;G=Gisle (X.400 SA format) gisle.hannemyr@nr.no (RFC-822 format) Inet: gisle@ifi.uio.no UUCP: ...!mcsun!ifi!gisle ------------------------------------------------
yfcw14@castle.ed.ac.uk (K P Donnelly) (01/15/91)
>> 6) Latin-1 does indeed cover all of Western Europe, but does <<not>> >> cover Greek, and therefore does not cover all the EEC. >Latin-1 (ISO 8859/1) does NOT cover "lappish", which is a minority language >used by the Lapps in Norway, Sweden and Finland. As the Nordic countries >are usually considered part of Western Europe, Latin-1 does not cover all >of Western Europe. >Lappish is covered by Latin-4 (ISO 8859/4). Latin-1 does not cover Welsh, which has something like 300 thousand or 400 thousand speakers. Wales is both in the EEC and in Western Europe. In fact Welsh is not covered by *any* of the parts of ISO 8859. The problem is that Welsh has accents not only on the vowels a e i o u but also on the semivowels w and y, and some of them are important. Welsh is included in the mechanism of ISO 6937, which is based on Teletext, allowed in X.400 headers, and which uses non-spacing "floating" accents to code accented characters in two bytes (often making processing difficult). However, a proposal to ammend the ISO 6937 *repertoire* to include Welsh was recently voted down. Kevin Donnelly
Philippe.Deschamp@Nuri.INRIA.Fr (01/18/91)
In article <6600@alpha.cam.nist.gov>, koontz@cam.nist.gov (John E. Koontz X5180) writes: |> From: macrakis@gr.osf.org |> Subject: 8-bit cleaning, Unicode, etc. ... |> 6) Latin-1 does indeed cover all of Western Europe, but does <<not>> |> cover Greek, and therefore does not cover all the EEC. In article <GISLE.91Jan14232159@kyrre.uio.no>, gisle@ifi.uio.no (Gisle Hannemyr) replies: |> Latin-1 (ISO 8859/1) does NOT cover "lappish", which is a minority language |> used by the Lapps in Norway, Sweden and Finland. As the Nordic countries |> are usually considered part of Western Europe, Latin-1 does not cover all |> of Western Europe. In article <7828@castle.ed.ac.uk>, yfcw14@castle.ed.ac.uk (K P Donnelly) adds: |> Latin-1 does not cover Welsh, which has something like 300 thousand or |> 400 thousand speakers. Wales is both in the EEC and in Western Europe. Latin-1 does not cover the french language, which is used in France and other countries (maybe mainly in other countries :-). It lacks the "oe" ligatures (\oe and \OE of TeX). Last time I looked at a map, France was in Western Europe... Philippe Deschamp. Tlx: 697033F Fax: +33 (1) 39-63-53-30 Tel: +33 (1) 39-63-58-58 Email: Philippe.Deschamp@Nuri.INRIA.Fr || ...!inria!deschamp Smail: INRIA, Rocquencourt, BP 105, 78153 Le Chesnay Cedex, France
tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) (01/19/91)
koontz@cam.nist.gov (John E. Koontz X5180) forwards POLYGLOT: (Please forward this to POLYGLOT.) > ------------------------------ > From: Tom McFarland <tommc@hpcvlx.cv.hp.com> > > ISO 10646 is standard being developed by official national representatives; > Unicode is a grass roots based, competing code set being proposed by a > group of vendors. Proposals are currently underway to make Unicode one of the accepted code planes in 10646, and to create a 10646U (U for Unicode) compaction form using only 16 bits, rather than the 32 bits required by 10646. The Unicode Consortium has no wish to compete with ISO 10646, and would prefer to work with ISO toward a truly useful standard. > there is not a one-to-one mapping and data may be lost > converting between the two. Mainly for the reason that Unicode includes many languages that 10646 does not represent. Both Unicode and 10646 fully represent all ISO 8859 and all existing Chinese/Japanese/Korean national standards. I believe that C/J/K unification is the right thing to do. Consider what the world would be like if English-speaking people insisted on having their own A-Za-z alphabet, separate from Spanish A-Za-z. This is exactly that East Asian countries are doing. > ISO 10646 is very specific in the forms of use allowed. One key > difference that comes to mind is that ISO prohibits assigning > character to row/column/plane/group values in the range 0x00-0x20, > 0x7f-0xA0, and 0xff. ISO has done this in an attempt to maintain some > level of backwards compatibility with hardware/software that recognize > these values as control codes. Unicode actively uses these values to > achieve its compactness. Unicode does leave empty slots for ASCII and ISO 8859 control codes. It's that sufficient? I don't understand the purpose of leaving any more empty slots than those. Perhaps someone knowledgeable about ISO 10646 could enlighten me.
ath@linkoping.telesoft.se (Anders Thulin) (01/27/91)
In article <1840@seti.inria.fr> Philippe.Deschamp@Nuri.INRIA.Fr writes: > Latin-1 does not cover the french language, which is used in France >and other countries (maybe mainly in other countries :-). It lacks >the "oe" ligatures (\oe and \OE of TeX). Last time I looked at a map, >France was in Western Europe... Considering that the OE ligature isn't used in *any* if the 8859/1-8 tables, I can't help wondering if it really is an important character. Perhaps it's a plot to force France out of Western Europe :-) -- Anders Thulin ath@linkoping.telesoft.se Telesoft Europe AB, Teknikringen 2B, S-583 30 Linkoping, Sweden
jaap@mtxinu.COM (Jaap Akkerhuis) (01/29/91)
In article <4947@srava.sra.co.jp> erik@srava.sra.co.jp (Erik M. van der Poel) writes: > Bill Tuthill writes: > > I believe that C/J/K unification is the right thing to do. Consider > > what the world would be like if English-speaking people insisted on > > having their own A-Za-z alphabet, separate from Spanish A-Za-z. > > I don't think that is a very good analogy. [stuff deleted] Maybe it is actually a good analogy. A lot of people actually consider the wiggly line above the n in spanish as a separate character, so Bill would like to see that omitted in spanish? :-). It is very difficult to make proper judgments about other people character sets when one doesn't speak the language. For instance, all scandinavian characters look a like for most outsiders. But actually, there are quite some differences depending whether one speaks Danish, Swedish or Norwegian. The C/J/K unification is not the right thing to do when either C, J or K have severe complaints about it. jaap
lee@sq.sq.com (Liam R. E. Quin) (01/30/91)
ath@linkoping.telesoft.se (Anders Thulin) writes: > Philippe.Deschamp@Nuri.INRIA.Fr writes: >> Latin-1 does not cover the french language [...]. It lacks >>the "oe" ligatures (\oe and \OE of TeX). >Considering that the OE ligature isn't used in *any* if the 8859/1-8 >tables, I can't help wondering if it really is an important character. Well, it is used in English in imported words such as [oe]illade (an amorous look or glance) and [oe]uvre (the works of an artist, painter, etc.). In the same way, [ae] is used in Encycolp[ae]dia, Medi[ae]val, [ae]gis, and in names such as [Ae]lfwin, [Ae]lfric, etc. Perhaps as these standards mature we'll see them becoming more widely useful. Or maybe the various inaccessible glyphs will simply not be used, and will fade away like a snark or a booju... :-( Lee -- Liam R. E. Quin, lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337 ``No question is so difficult to answer as that to which the answer is obvious'' -- George Bernard Shaw
enag@ifi.uio.no (Erik Naggum) (01/31/91)
Philippe, The oe ligature is precisely a ligature, very much unlike the ae used in Denmark, Norway, and Iceland (perhaps others). oe is not a single character any more than the ligatures, fi, fl, ffi, and ffl are. oe does not influence collation order, but is spelled out. Notice that it's not an error to write "oeuvre" instead of "<oe>uvre" (where <> is used to denote a ligature). It's very much an error to write "aere" instead of "<ae>re" in Norwegian, not uncommon but still gross ASCII abuse to the contrary notwithstanding. I've gleamed this gem from participation in numerous mailing lists, but I've forgot precisely which one. Most probably from the ISO 10646 list on BITNET somewhere. -- [Erik Naggum] Snail: Naggum Software / BOX 1570 VIKA / 0118 OSLO / NORWAY Mail: <erik@naggum.uu.no>, <enag@ifi.uio.no> My opinions. Wail: +47-2-836-863 Another int'l standards dude.
tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) (01/31/91)
erik@srava.sra.co.jp (Erik M. van der Poel) writes: > In ISO 10646, it is easy to mix Japanese and Chinese in one sentence. > Can it be done in Unicode? (I ask, because I don't know.) Just switch fonts from Kanji to Hanzi. This is similar to what you would do if you wanted to print the title of a book-- you'd switch from roman to italics. Unicode assumes that small differences between Han characters (of identical meaning) is a font issue, not a character coding issue. Different fonts are favored in Japan and in China, just as Times is popular in the US and Garamond is popular in France. > However, Han unification may be quite useful. Amen! It's interesting that the Japanese delegation voted against DIS 10646 because it didn't include Han unification. Bill
ath@linkoping.telesoft.se (Anders Thulin) (01/31/91)
In article <1991Jan29.200653.23928@sq.sq.com> lee@sq.sq.com (Liam R. E. Quin) writes: >ath@linkoping.telesoft.se (Anders Thulin) writes: >>Considering that the OE ligature isn't used in *any* if the 8859/1-8 >>tables, I can't help wondering if it really is an important character. > >Well, it is used in English in imported words such as [oe]illade (an amorous >look or glance) and [oe]uvre (the works of an artist, painter, etc.). In the >same way, [ae] is used in Encycolp[ae]dia, Medi[ae]val, [ae]gis, and in names >such as [Ae]lfwin, [Ae]lfric, etc. I should have used the word 'indispensable' instead. I doubt it is so in English - all dictionaries I have consulted use the separate forms as headwords - the ligatures are occasionally listed as alternative spellings. My problem was with French - a language I don't know. Is the <oe> ligature really indispensable - I can't help thinking it would have made its way into the Latin-1/... code tables if it was. Is `chef-d'<oe>uvre' the only way to spell that word? -- Anders Thulin ath@linkoping.telesoft.se Telesoft Europe AB, Teknikringen 2B, S-583 30 Linkoping, Sweden
amanda@visix.com (Amanda Walker) (02/02/91)
I object to the idea of leaving out <oe> because it is "unnecessary." It is still useful, especially when representing printed texts accurately. In a similar vein, I'd like to see glyphs for "long s," ligatures such as "ct," "st," and so on, without having to resort to private encodings. -- Amanda Walker Visix Software Inc.
jeffrey@cs.chalmers.se (Alan Jeffrey) (02/03/91)
In article <723@castor.linkoping.telesoft.se> ath@linkoping.telesoft.se (Anders Thulin) writes: >I should have used the word 'indispensable' instead. I doubt it is so >in English - all dictionaries I have consulted use the separate forms >as headwords - the ligatures are occasionally listed as alternative >spellings. The problem is that even when writing in English, you frequently need oe and ae ligatures, and even some accented letters. When? Precisely whey you are talking about the English language itself. This discussion, for example, couldn't be typeset by a system that used Latin1. If Latin1 is proposed as a standard for (for instance) encoding the textual material of published books, it's going to have to cope with people (for example historians, or linguists, or lit. critters) who need to be able to quote texts from more than 30 years ago. And as M\'\i che\'al \'O Searc\'oid pointed out at TeX90, Latin1 doesn't cover Irish, which has some accented constonants. Oh, and my Chambers 20th Century dictionary lists the following words beginning \ae\ or \oe\ which don't have ae and oe variants: \ae sc (the O.E. letter `ash' now written \ae!) \oe il-de-b\oe uf (a little round window) \oe illade (an ogle) None of these are marked as foreign or obsolete. Of course this was eight years ago, things may be different now... >My problem was with French - a language I don't know. Is the <oe> >ligature really indispensable - I can't help thinking it would have >made its way into the Latin-1/... code tables if it was. Is >`chef-d'<oe>uvre' the only way to spell that word? Hmm, an interesting idea---`if it was useful it would be in the standard'. Ahh, if only ISO worked that way... Cheers, Alan. -- Alan Jeffrey Tel: +46 31 72 10 98 jeffrey@cs.chalmers.se Department of Computer Sciences, Chalmers University, Gothenburg, Sweden
ath@linkoping.telesoft.se (Anders Thulin) (02/03/91)
In article <1991Feb1.231640.3959@visix.com> amanda@visix.com (Amanda Walker) writes: >I object to the idea of leaving out <oe> because it is "unnecessary." >It is still useful, especially when representing printed texts >accurately. In a similar vein, I'd like to see glyphs for "long s," >ligatures such as "ct," "st," and so on, without having to resort >to private encodings. Of course it is useful - I'm not saying it isn't. It's the degree of usefulness I'm interested in. Your examples gives additional light on the topic: is the long s useful to suficiently many people that it should be placed in a 8-bit code set? My reply is no. Similarly with ct, st, ffi, fi, and the rest. They can equally well be represented by the expanded versions. Only a very small class of people (textual critics) will be interested, but I believe they have other and better ways of coping with problems like these. So, I am asking again: is the <oe> in French only one of these special ligatures that convey no extra information, or is it a separate character that *must* be included if the code table should be of any use for French texts? -- Anders Thulin ath@linkoping.telesoft.se Telesoft Europe AB, Teknikringen 2B, S-583 30 Linkoping, Sweden
ath@linkoping.telesoft.se (Anders Thulin) (02/03/91)
In article <4363@undis.cs.chalmers.se> jeffrey@cs.chalmers.se (Alan Jeffrey) writes: >In article <723@castor.linkoping.telesoft.se> ath@linkoping.telesoft.se (Anders Thulin) writes: >>[...<oe> left out of Latin-1 due to its being dispensable?...> >If Latin1 is proposed as a standard for (for instance) encoding the >textual material of published books, it's going to have to cope with >people (for example historians, or linguists, or lit. critters) who >need to be able to quote texts from more than 30 years ago. True, but of minor relevance. 8859/1 was (as far as I understand it) intended for interchange of modern langauges - obsolete and obsolescent forms were not included. Is *that* why French <oe> isn't there? (And is that why the y with dieresis is there?) >Oh, and my Chambers 20th Century dictionary lists the following words >beginning \ae\ or \oe\ which don't have ae and oe variants: > > \ae sc (the O.E. letter `ash' now written \ae!) > \oe il-de-b\oe uf (a little round window) > \oe illade (an ogle) > >None of these are marked as foreign or obsolete. Of course this was >eight years ago, things may be different now... I'm almost sure you're joking now... I have no quarrel with <ae> - it's already in Latin-1. The two last words are obviously French (weren't they marked as such?), which brings me back to the problem I'm really interested in: is <oe> an indispensable character/glyph of French? >>ligature really indispensable - I can't help thinking [ <oe> ] would >>made its way into the Latin-1/... code tables if it was. > >Hmm, an interesting idea---`if it was useful it would be in the >standard'. Ahh, if only ISO worked that way... I can't help thinking that the Latin-1 code table as well as the others must have been developed in close collaboration with the national standard bodies. Since the French language appears to be rather closely monitored I would expect very loud complaints from the French national standards organizations if characters that were of vital importance to the modern language weren't included *any* of the Latin-n code tables, particularly the one that is claimed to cover the most important Western languages. Or don't they care? Perhaps there's a FRASCII which makes more sense to use?) -- Anders Thulin ath@linkoping.telesoft.se Telesoft Europe AB, Teknikringen 2B, S-583 30 Linkoping, Sweden
enag@ifi.uio.no (Erik Naggum) (02/04/91)
Amanda, Most of these weird ligatures are covered in ISO DIS 10646. I didn't find the <ct> ligature in my last scan (got the latest draft in the mail from Mike Ksar of Hewlett Packard only yesterday). For those who might be scared of ISO DIS 10646: The default encoding with one octet compaction method, is equivalent to ISO 8859-1. -- [Erik Naggum] <enag@ifi.uio.no> Naggum Software, Oslo, Norway <erik@naggum.uu.no>
amanda@visix.com (Amanda Walker) (02/05/91)
In article <ENAG.91Feb4003441@holmenkollen.ifi.uio.no> enag@ifi.uio.no (Erik Naggum) writes: >Most of these weird ligatures are covered in ISO DIS 10646. Granted, and I definitely think that 10646 (or something similar, like Unicode) is the only effective approach for many applications. It certainly looks like an improvement over ISO IS 2022 :)... Don't get me wrong--I like the ISO 8859 code sets, as far as they go. What I was objecting to was the idea that "archaic" characters weren't useful to represent in electronic form. I admit that I may have overreacted; I just have strong opinions about the matter. -- Amanda Walker amanda@visix.com Visix Software Inc. ...!uunet!visix!amanda -- Courage is the willingness of a person to stand up for his beliefs in the face of great odds. Chutzpah is doing the same thing wearing a Mickey Mouse hat.
jeffrey@cs.chalmers.se (Alan Jeffrey) (02/05/91)
In article <ENAG.91Feb4020953@holmenkollen.ifi.uio.no> enag@ifi.uio.no (Erik Naggum) writes: >Let me take issue with the bizarre idea that ISO 8859-1 should be >sufficient for typesetting. This wasn't what I was arguing---certainly Latin1 shouldn't cover ligatures, font changes, etc. which are too dependent on typographical decisions, but it *should* be capable of encoding plain text. Tables, mathematics, fonts, blah etc. need special encoding, but any serious character encoding scheme should be able to handle, say, a dictionary (minus the pronounciation guide). In an idealish world, Latin1 would cover the plain text characters from the Western European languages, plus the standard typewriter/programming symbols. Unfortunately there's more than 191 of these, so something's got to give, and it's probably the claim that Latin1 covers all Western Europe. Unfortunately, whichever standard covers the US will probably end up becoming the de facto standard for international transmission, so we'll end up no better off than today. Sighh... Alan. -- Alan Jeffrey Tel: +46 31 72 10 98 jeffrey@cs.chalmers.se Department of Computer Sciences, Chalmers University, Gothenburg, Sweden
jeffrey@cs.chalmers.se (Alan Jeffrey) (02/05/91)
In article <727@castor.linkoping.telesoft.se>: >True, but of minor relevance. 8859/1 was (as far as I understand it) >intended for interchange of modern langauges - obsolete and >obsolescent forms were not included. Is *that* why French <oe> isn't >there? (And is that why the y with dieresis is there?) The question is---is Latin1 intended for modern languages, or modern documents? Documents written today still need to be able to refer to obsolete usages. The problem is where you draw the line to say `this character is strange enough that its usage is completely dead', and I'm not convinced that \oe\ is that old. I even saw it on a road sign in the UK a few months ago. [About various words beginning `\oe' in Chambers.] >I'm almost sure you're joking now... Well, yes, it wasn't meant particularly seriously, but it does mean that eight years ago there were still words in English that Chambers reckoned couldn't be written without \oe. \OE illade is a bit of a weirdo, but \oe il-de-b\oe f is a common enough architectural term that I've heard of it. And like I said, neither of these are marked foreign or obsolete. Oh, as an aside, if I type` \oe' as `oe' in my plain text, how will I cope with the fact that it should be capitalised as `OEillade' if the oe is a ligature, and `Oeillade' otherwise? Cheers, Alan. -- Alan Jeffrey Tel: +46 31 72 10 98 jeffrey@cs.chalmers.se Department of Computer Sciences, Chalmers University, Gothenburg, Sweden
erik@srava.sra.co.jp (Erik M. van der Poel) (02/05/91)
> > Amen! It's interesting that the Japanese delegation voted against > > DIS 10646 because it didn't include Han unification. > > > > Bill Tuthill > > I understand that Japan is also adamantly against Unicode. > > Mark Leisher Mark, don't believe everything you see on the net (including what I'm writing now :-). Bill, could you re-check the accuracy of your article, and, if it is accurate, please give us the names of the Japanese delegates involved, or the name of their working group, and any other relevant info such as document numbers and/or voting date/place. > Does > anyone have any current info on anything that is being proposed in > lieu of ISO/IEC DIS 10646 and Unicode? Some Japanese have set up an informal group to discuss the possible extension of Japanese EUC and Shift-JIS to support the new supplementary Kanji set called JIS X 0212. Some of these proposals mushroomed into "internationalization" extensions. I may be able to provide more info if anyone is interested. (This is not to say that "the Japanese" are against both 10646 and Unicode.) - -- Erik M. van der Poel erik@sra.co.jp Software Research Associates, Inc., Tokyo, Japan TEL +81-3-3234-2692
lee@sq.sq.com (Liam R. E. Quin) (02/06/91)
amanda@visix.com (Amanda Walker) writes: >Don't get me wrong--I like the ISO 8859 code sets, as far as they >go. What I was objecting to was the idea that "archaic" characters >weren't useful to represent in electronic form. I admit that I >may have overreacted; I just have strong opinions about the matter. Of course, ligatures like [oe] and [ae] are still considered `correct' in oeuvre, mediaeval, encyclopaedia, Oedipus, Aethelwine, etc., in the UK -- `archaic' is relative. In other words, I agree with you strongly! I think also that the distinction between glyph-name (ae-ligature), glyph and position in collation sequence must be made clear, especially as collating sequence varies from nationality to nationality. Once we get so far advanced that we can conceive of printing a Welsh dictionary, we'd better be able to sort the entries correctly :-) Some of the work on fonts from ISO 9541 might be profitable reading here. And yes, I'd love a standard position for tall-s, yogh, etc., but the ct ligature should be inserted automatically in the same way that the ff ligature is made at the moment in electronic systems. Lee -- Liam R. E. Quin, lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337
yfcw14@castle.ed.ac.uk (K P Donnelly) (02/07/91)
em@dce.ie (Eamonn McManus) writes: >No, modern Irish does not have any accented consonants. It does require >the ability to put acute accents on all five vowels, though. Older Irish >writing used a dot above a consonant to indicate lenition, which is now >written as a h after the letter. But this writing uses a special script >which is not in Latin1 anyhow. Agreed, except for the last sentence. It was actually just a special font, like the Fraktur formerly used with German. The character set standards don't cover fonts. Kevin Donnelly
jeffrey@cs.chalmers.se (Alan Jeffrey) (02/07/91)
In article <1991Feb5.174923.16236@sq.sq.com> lee@sq.sq.com (Liam R. E. Quin) writes: >amanda@visix.com (Amanda Walker) writes: >I think also that the distinction between glyph-name (ae-ligature), glyph >and position in collation sequence must be made clear, especially as >collating sequence varies from nationality to nationality. Once we get so >far advanced that we can conceive of printing a Welsh dictionary, we'd better >be able to sort the entries correctly :-) Agreed totally---one of the best tests for `is this glyph a separate letter or just a decorated form of another letter?' is whether it alphabetizes and/or capitalizes differently. So French \'a can be regarded as an a with an accent, but Swedish \"a can't, as it alphebetizes to the back of the dictionary. On this basis I'd claim `\oe' as a separate letter, as it capitalizes to `\OE' whereas `oe' capitalizes to `Oe'. In general, collation is much more difficult than it appears---not only does it vary from language to language (is \"o before or after p?) but also from application to application (is `McCarthy' before or after `May'?). And you should try convincing BibTeX that some people's surnames come before their given names... Cheers, Alan. -- Alan Jeffrey Tel: +46 31 72 10 98 jeffrey@cs.chalmers.se Department of Computer Sciences, Chalmers University, Gothenburg, Sweden
lee@sq.sq.com (Liam R. E. Quin) (02/12/91)
lee@sq.sq.com (Liam R. E. Quin) writes: > I think also that the distinction between glyph-name (ae-ligature), glyph > and position in collation sequence must be made clear, especially as > collating sequence varies from nationality to nationality. [...] jeffrey@cs.chalmers.se (Alan Jeffrey) writes: >Agreed totally---one of the best tests for `is this glyph a separate >letter or just a decorated form of another letter?' is whether it >alphabetizes and/or capitalizes differently. Another important point is that if one were (for example) to use "oe" instead of the oe-ligature in French, one could no longer set those words which contain o and e as distinct glyphs, such as `coexistence'. Perhaps typesetting systems could have a ligature-exception table, which would prevent such errors -- it's not clear to me. I don't know of any French words which change in meaning if the oe-ligature is replaced by "oe", but then I don't know French. Examples of Dutch IJ capitalisation (which are correct in all of the atlases I own) have been provided recently on the net, of course. Oedipus and Aelfwine don't look at all right to me...! Lee -- Liam R. E. Quin, lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337