[comp.text] International

npn@cbnewsl.att.com (nils-peter.nelson) (12/27/90)

Thanks to all those who responded to my request for international
troff requirements.  The responses were uniformly helpful and
specific (if occasionally impatient with the backwardness
evident in this little part of the world).  In addition to public
responses, I got private letters from: Anders Thulin, Dan Berry,
Dick Dunn, Heimir Thor Sverrisson, Chris Lewis, Alexios Zavras,
Jaap, Robert Andersson, Steve Azmier, and a very appropriate paper
from Keizer, Simonsen and Akkerhuis [KSA].

The consensus appears to be:
1. Allow all DWB components to read 8-bit characters as defined
by ISO 8859-1, a.k.a Latin-1.  The editing and preparation of
such documents is the province of 8-bit terminals, 8-bit editors,
and not our concern.  This requires that we remove all &177's.
2. Default behavior for troff should be "8-bit in, 8-bit out".
The postprocessors will be rewritten to take this into account.
In addition, we should allow a "-7b" option to force troff
output to be in the ASCII (ISO 646, 7 bit) subset.  This would permit
mailing of ditroff output to the part of North America that
hasn't caught on to ISO 8859.
3. Recognize two-character 7-bit escapes so that people who
don't have 8-bit terminals can still create documents with
the extra characters,  [KSA] have proposed a reasonable standard
convention which could serve as both input and output for troff.
(e.g., \(oa for 'aring') but there are other proposals we
will look at as well.
4. Reserve \C'string' and \N'number' for the truly odd characters
that don't have a more convenient representation.
5. Hyphenation may present insurmountable problems; we'll see
if anyone else (e.g. Knuth) has solved them.  Worst case,
however, is that we'll hyphenate badly, and you'll have to
turn it off.

We will probably package this as DWB 3.2, which will be an
"incremental" upgrade to DWB 3.1 (this means a minor fee for
those who are DWB 3.1 licensees). Some of the work has already
been completed, so the package should be ready around May 1991.

jjc@jclark.UUCP (James Clark) (12/31/90)

In article <1990Dec27.155046.14520@cbnewsl.att.com> npn@cbnewsl.att.com (nils-peter.nelson) writes:

   The consensus appears to be:
   1. Allow all DWB components to read 8-bit characters as defined
   by ISO 8859-1, a.k.a Latin-1.  The editing and preparation of
   such documents is the province of 8-bit terminals, 8-bit editors,
   and not our concern.  This requires that we remove all &177's.

groff already does this.

   2. Default behavior for troff should be "8-bit in, 8-bit out".
   The postprocessors will be rewritten to take this into account.

groff already does this.

   In addition, we should allow a "-7b" option to force troff
   output to be in the ASCII (ISO 646, 7 bit) subset.  This would permit
   mailing of ditroff output to the part of North America that
   hasn't caught on to ISO 8859.

I'm unconvinced by this.  What's wrong with using uuencode?  In any
case, if you want to send a document to somebody, it would seem to me
to be better to send either the ditroff input file or the
postprocessor output (since the ditroff output is tailored to a
particular device anyway).

   3. Recognize two-character 7-bit escapes so that people who
   don't have 8-bit terminals can still create documents with
   the extra characters,  [KSA] have proposed a reasonable standard
   convention which could serve as both input and output for troff.
   (e.g., \(oa for 'aring') but there are other proposals we
   will look at as well.

groff already does this.  It uses the two-character names described in
[KSA].  It would be a pity if DWB adopted an incompatible scheme.

   4. Reserve \C'string' and \N'number' for the truly odd characters
   that don't have a more convenient representation.

groff takes this approach.

   5. Hyphenation may present insurmountable problems; we'll see
   if anyone else (e.g. Knuth) has solved them.  Worst case,
   however, is that we'll hyphenate badly, and you'll have to
   turn it off.

I believe groff has a good solution to the hyphenation problem.
Hyphenation works in terms of hyphenation codes. Initially, the
letters `a' to `z' have `a' to `z' as their hyphenation codes, and `A'
to `Z' have `a' to `z'. There's a request that allows you to specify
the hyphenation code for any normal or special character; for example,

  .hcode \(^a a

would give `\(^a' (the name for `a' with a circumflex accent) a
hyphenation code of `a'.  Groff uses the same hyphenation algorithm
that TeX does (invented by Frank Liang): the hyphenation process is
controlled by a set of hyphenation patterns; letters in the patterns
are interpreted as hyphenation codes.  By supplying an appropriate
file of patterns and set of `hcode' requests, it should be possible to
make groff correctly hyphenate languages other than English.

   We will probably package this as DWB 3.2, which will be an
   "incremental" upgrade to DWB 3.1 (this means a minor fee for
   those who are DWB 3.1 licensees). Some of the work has already
   been completed, so the package should be ready around May 1991.

These features are in the currently released version of groff (0.6).

James Clark
jjc@jclark.uucp
jjc@ai.mit.edu

bruce@balilly.UUCP (Bruce Lilly) (01/02/91)

In article <JJC.90Dec31140544@jclark.jclark.UUCP> jjc@jclark.UUCP (James Clark) writes:
>In article <1990Dec27.155046.14520@cbnewsl.att.com> npn@cbnewsl.att.com (nils-peter.nelson) writes:
>groff already does this.
>
>   In addition, we should allow a "-7b" option to force troff
>   output to be in the ASCII (ISO 646, 7 bit) subset.  This would permit
>   mailing of ditroff output to the part of North America that
>   hasn't caught on to ISO 8859.
>
>I'm unconvinced by this.  What's wrong with using uuencode?  In any
>case, if you want to send a document to somebody, it would seem to me
>to be better to send either the ditroff input file or the
>postprocessor output (since the ditroff output is tailored to a
>particular device anyway).

Uuencode/uudecode are not universally available.

If the ditroff input file is in an 8-bit character set, it is unmailable
via some mail transport software.  Likewise for 8-bit output, hence the
desire to restrict the output to a 7-bit character set.

The 'd' and 'i' in ditroff atnd for "device" and "independent",
respectively.  Ditroff output is *not* tailored to any particular device.
The ditroff output can be interpreted by postprocessors for specific
devices.  The same cannot necessarily be said of other text procssors
(perhaps including groff).

Postprocessor output *is* tailored to a specific device, hence is not
suitable for widespread distibution.  Also, note that some of these
device-specific output formats (such as PostScript) are both extremely
verbose (more so than ditroff output) and may include 8-bit characters.
--
	Bruce Lilly		blilly!balilly!bruce@sonyd1.Broadcast.Sony.COM

jaap@mtxinu.COM (Jaap Akkerhuis) (01/03/91)

In article <JJC.90Dec31140544@jclark.jclark.UUCP> jjc@jclark.UUCP (James Clark) writes:
 > 
 > would give `\(^a' (the name for `a' with a circumflex accent) a
 > hyphenation code of `a'.  Groff uses the same hyphenation algorithm
 > that TeX does (invented by Frank Liang): the hyphenation process is
 > controlled by a set of hyphenation patterns; letters in the patterns
 > are interpreted as hyphenation codes.  By supplying an appropriate
 > file of patterns and set of `hcode' requests, it should be possible to
 > make groff correctly hyphenate languages other than English.

Not necessarily. This depends on the rules of the language.  The
hyphenation rules might threat the ``a^'' completely different then
an ``a'' for a given language. In that case, mapping is not good
enough.


	jaap

jay@silence.princeton.nj.us (Jay Plett) (01/03/91)

In article <1991Jan2.024946.10442@blilly.UUCP>, bruce@balilly.UUCP (Bruce Lilly) writes:
 ...
> The 'd' and 'i' in ditroff atnd for "device" and "independent",
> respectively.  Ditroff output is *not* tailored to any particular device.
> The ditroff output can be interpreted by postprocessors for specific
> devices. ...
That's misleading.  Ditroff output is not only device dependent, it
is dependent on a particular set of width tables for a particular
device.  Ditroff and the postprocessor MUST use the same set of
width tables.  Ditroff outputs motions that are derived from the
width tables.  Moreover, when a character does not exist in the
current font, no font change is encoded in ditroff's output.
Ditroff assumes that the postprocessor will not only have the same
widths, but that it will also use the same strategy for noticing
that a font change is necessary and for finding the same character
in the same font that ditroff found it in.

The "di" in ditroff means that device dependence is bound at
run-time rather at compile-time.

> Postprocessor output *is* tailored to a specific device, hence is not
> suitable for widespread distibution.
If the same device will be used, there's no harm in distributing
postprocessor output.  It would be rash to distribute ditroff output
and expect it to print correctly, quite possibly even among similar
systems at the same site.

	...jay

peterson@lyle.austin.ibm.com (James L. Peterson/1000000) (01/03/91)

In article <679@silence.princeton.nj.us> jay@silence.princeton.nj.us (Jay Plett) writes:
>That's misleading.  Ditroff output is not only device dependent, it
>is dependent on a particular set of width tables for a particular
>device.  Ditroff and the postprocessor MUST use the same set of
>width tables.  Ditroff outputs motions that are derived from the
>width tables.  Moreover, when a character does not exist in the
>current font, no font change is encoded in ditroff's output.
>Ditroff assumes that the postprocessor will not only have the same
>widths, but that it will also use the same strategy for noticing
>that a font change is necessary and for finding the same character
>in the same font that ditroff found it in.

That's not really true.  Ditroff dvi codes include the output motions
that troff expects as a result of output.  This means that the post
processor need not have any width tables at all, IF the output device
does not automatically move after each output character.  Thus,
troff dvi will include "46z37o29t" which means to move 46 units,
output a 'z', move 37 units, output an 'o', move 29 units, output
a 't', ...  ALL movement is explicit in the dvi file.

Contrast this with TeX dvi which assumes that the postprocessor
"knows" the same character widths as TeX did.  TeX dvi would include
only "zot" and would expect that the postprocessor would know the
width of the characters and would move the "right" amount.  This means
that TeX dvi is smaller (doesn't have to include all the movements
that are almost always the same for all the characters), but also means
you have no idea where to put characters without the width tables.

troff dvi, on the other hand, can easily be interpreted without the
character width tables, or with other width tables.  The position of
each and every character is completely determined by the dvi file.
Now if you don't have the right fonts (like you are printing on a
different output device than the file was produced for), the characters
will look strange, but each and every one will be in exactly the
right place.

I have used this in two ways: (1) it is easy to write a screen display
for troff dvi that will preview a document at different resolution and
with different fonts (screen fonts rather than printer fonts).  The
output is not as readable, but it shows the correct placement of the
characters, so you can see margins, tables, and if expanded enough
is still quite readable.  (2) I can take documents that are meant for
a typesetter at one site, ship them across country and print them on
a laser printer of different resolution with different fonts with ease.
The post processor has to be aware of the difference in font tables
and be willing to substitute or ask for font substitutes (but with
troff's R, I, B fonts it is fairly easy to substitute.  Even the
special two-character names are portable.) and it must distinguish
between the input resolution (of the dvi file) and the output resolution
(of the output device) which may be different, and be prepared to
scale accordingly.

Neither of these is possible/easy with TeX dvi, unless you have the
exact same character font tables.

>If the same device will be used, there's no harm in distributing
>postprocessor output.  It would be rash to distribute ditroff output
>and expect it to print correctly, quite possibly even among similar
>systems at the same site.
>
A lot of this statement depends on what "print correctly" means.  If
you mean print and look like it was formatted for the device that it
was printed on, this is a reasonable statement, but if you mean
"print correctly" to mean "put each character exactly where troff
wanted it to be" (even though the characters are a different font,
and probably different widths), troff dvi is great for this.

James L. Peterson  (peterson@futserv.austin.ibm.com)
-- 
James L. Peterson
   IBM Advanced Workstations Div. !'s: cs.utexas.edu!ibmchs!peterson
   11400 Burnet Road, MS 2812     @'s: @CS.UTEXAS.EDU:peterson@ibmchs.uucp
   Austin, Texas 78758            !&@: ibmchs!peterson@CS.UTEXAS.EDU

jjc@jclark.UUCP (James Clark) (01/04/91)

In article <1991Jan2.231520.21468@mtxinu.COM> jaap@mtxinu.COM (Jaap Akkerhuis) writes:

   In article <JJC.90Dec31140544@jclark.jclark.UUCP> jjc@jclark.UUCP (James Clark) writes:
    > 
    > would give `\(^a' (the name for `a' with a circumflex accent) a
    > hyphenation code of `a'.  Groff uses the same hyphenation algorithm
    > that TeX does (invented by Frank Liang): the hyphenation process is
    > controlled by a set of hyphenation patterns; letters in the patterns
    > are interpreted as hyphenation codes.  By supplying an appropriate
    > file of patterns and set of `hcode' requests, it should be possible to
    > make groff correctly hyphenate languages other than English.

   Not necessarily. This depends on the rules of the language.  The
   hyphenation rules might threat the ``a^'' completely different then
   an ``a'' for a given language. In that case, mapping is not good
   enough.

There is nothing in the groff scheme that constrains `a^' to have the
same hyphenation code as `a'.  A hyphenation code can be any single
input character that isn't a digit or white space.  For example, you
could make the hyphenation code of `a^' (and `A^') the character which
is `a^' in ISO 8859-1.  You just have to make sure that the `hcode'
requests match the conventions that were used in the generation of the
hyphenation patterns.

James Clark
jjc@jclark.uucp
jjc@ai.mit.edu