[comp.text] troff postprocessors for ISO 8859 characters

npn@cbnewsl.att.com (nils-peter.nelson) (12/29/90)

The original DWB's (1.0 and 2.0) supported a variety of
printers: default was C/A/T, others were daisy, Imagen,
Xerox, etc.  In 3.1 we feature PostScript but also provide
Imagen and HP LaserJet support.
Our plan would be to provide support for the European standard
(ISO 8859-1) character set only with the PostScript postprocessor,
dpost.  The reason is that the HP LaserJet requires host-resident
fonts, in bitmap form, for every point size. The size of the
font support for the LaserJet is already several megabytes,
and the additional characters would add considerably. I'm not
sure where we'd get the bitmaps from, either.
PostScript already provides most if not all of the additional
characters-- they are already in the printer, we only have to
generate the name or position of the character in dpost.
So, my question is, have the Europeans settled on PostScript as
a standard for printers, or is there something else we should
be supporting?
(Special note to HP LaserJet owners: as you may already know,
for US $700 you can add a PostScript cartridge to your LaserJet II.
In addition to the added flexibility of PostScript you will
probably recover your investment with the disk space you save
when you rm the LaserJet bitmaps!)

clewis@ecicrl.UUCP (Chris Lewis) (12/30/90)

In article <1990Dec28.195703.2749@cbnewsl.att.com> npn@cbnewsl.att.com (nils-peter.nelson) writes:

[Re: DWB 3.2 support for Latin-1]

>Our plan would be to provide support for the European standard
>(ISO 8859-1) character set only with the PostScript postprocessor,
>dpost.  The reason is that the HP LaserJet requires host-resident
>fonts, in bitmap form, for every point size. The size of the
>font support for the LaserJet is already several megabytes,
>and the additional characters would add considerably. I'm not
>sure where we'd get the bitmaps from, either.

>PostScript already provides most if not all of the additional
>characters-- they are already in the printer, we only have to
>generate the name or position of the character in dpost.
>So, my question is, have the Europeans settled on PostScript as
>a standard for printers, or is there something else we should
>be supporting?
>(Special note to HP LaserJet owners: as you may already know,
>for US $700 you can add a PostScript cartridge to your LaserJet II.
>In addition to the added flexibility of PostScript you will
>probably recover your investment with the disk space you save
>when you rm the LaserJet bitmaps!)

I get my revenge...  Oh so sweet.

Psroff has solved most of these difficulties, and I would have made
them available as source for AT&T to use, but somehow "Chris Lewis
doesn't feel that way".  Nyah, nyah ;-)

[Ronald Khoo was right, you should be careful about egregarious misquoting
of guys like me.  What goes around comes around.]

But, I'm a nice guy, so I'll tell you how to solve your problems anyways:

    Font compression of HP SFP's (native HP PCL fonts):
	- compress (the PD one which I believe is now almost POSIX
	  required aside from the copyright issue which is still with
	  us.)
	- TeX PK format (compress won't compress them)
	These are the sizes of a Helvetica font at 10 point in three different
	formats (H.10.sfp doesn't have the full Latin-1 set, but it should
	be reasonably close):
	    -rw-r--r--  1 clewis  users      3988 Jul 28 23:42 H.10.pk
	    -rw-r--r--  1 clewis  users     10241 Dec 29 12:57 H.10.sfp
	    -rw-r--r--  1 clewis  users      5149 Dec 29 12:57 H.10.sfp.Z
	Normally, psroff is told whether to look for a ".pk" or ".sfp"
	font file for a given font at a specific size.  However, psroff's
	font reader doesn't care whether the file it finds is PK or SFP
	because you can tell from the first byte whether it's a SFP or
	PK, and the reader automatically switches to the right decoding
	software.  If psroff can't find a file with the .pk or .sfp suffix,
	it automatically checks for a tacked on ".Z", and will popen a
	zcat (compress -dc) if it finds one to read and decode the font file.

	Psroff actually maintains the font internally as more of a PK
	format, but will read the SFP as a variant of the "unpacked PK"
	format.  Of course, the emission of the font is in SFP format.
	(Does DWB's 3.1 LJ emitter support incremental downloading?
	Psroff and jetroff does)

	The compression variant is really easy for you to encorporate
	into DWB 3.2.  You could ship the fonts entirely compressed, and
	then tell the customer to uncompress those fonts that are used a
	lot to eliminate the performance hit of decompression most of
	the time, but still have the full set available for immediate
	use.  Psroff users haven't complained to me about the
	performance of this.  (They did about other stuff, but I've
	fixed that).

	The PK to SFP conversion isn't that easy (unless you steal
	psroff source).

	Standalone programs to convert PK's to SFP's (including
	changing the mappings) is included with psroff (SoftQuad is
	using a version of this software with my blessing to create
	some of the fonts they distribute).  Jetroff includes a program
	to convert SFP's to PK's which I use to create the PK format fonts I
	distribute with psroff.

    Font/code sources:
	1 HP has sets of at least Roman and Helvetica at the sizes you'll
	  need.  (They seem to be discontinuing the floppy version however.
	  I know that they have Latin-1 symbol sets, but I don't know whether
	  they're currently available on floppy).  Maybe you can do a deal
	  with them.  These are VERY good-looking fonts - to my eye they look
	  nicer than the LaserWriter's Postscript fonts.
	2 TeX PK's are available that have most of the characters you
	  need (eg: the University of Toronto distribution).  Psroff has
	  facilities to search for and merge/remap these files into SFP's.
	  ("buildfonts")
	3 The freeware/shareware version of jetroff had PK's that buildfonts
	  works with.
	4 The commercial version of Jetroff has similar PK's and might
	  have the Latin-1 extensions too.
	5 METAFONT.

"cm" PK fonts are rather ugly, but there are other fonts available that
look nicer (eg: the am or jetroff's jm)

The fonts that come with psroff for laserjets are built out of 2 and 3
(indirectly 5 of course), and I know of people using psroff with 1 and 4.
And a few people have parts of the Latin-1 set working thru psroff.
(I'm working on full support for them - thanks for eliciting the
paper on the subject from the net).

One thing you should be very aware of is that the HP Laserjet III has
font scaling built in, and you can get a CG Times and Universal at any
size you want out of them, just by requesting them by characteristic.
You should support this.  Psroff does, but I don't have the width table
issue sorted out quite yet.
-- 
Chris Lewis, Phone: (613) 832-0541
UUCP: uunet!utai!lsuc!ecicrl!clewis
Moderator of the Ferret Mailing List (ferret-request@eci386)
Psroff mailing list (psroff-request@eci386)

jjc@jclark.UUCP (James Clark) (01/01/91)

Groff has a composite character feature which helps with using ISO
8859/1 with a device that doesn't have all the ISO 8859/1 characters.
For example, suppose you want to be able to input an `a' circumflex
using ISO 8859/1 (`a' circumflex has code 0342), and suppose you have
a device which has the letter `a' and has a circumflex accent, but
doesn't have an `a' circumflex as a single character.  Assuming `\*^'
has been defined appropriately, you just have to do:

.char \342 a\\*^

(by \342 I mean the character whose code is 0342.)  After this you can
use \342 exactly as if your output device provided an `a' with a
circumflex as a single character.

The `char' request is useful for other things too: for example,

.char \(ru \D'l .5m 0'

will get you a \(ru character if your output device happens not to
have one.

Characters defined with the `char' request can be used just like other
characters: for example, they will be hyphenated properly (after an
appropriate `hcode' request); they can also be used with the `lc'
request, with `tr' request and with the `\l' or `\L' escape sequences.

James Clark
jjc@jclark.uucp

clewis@ecicrl.UUCP (Chris Lewis) (01/03/91)

In article <JJC.90Dec31183059@jclark.jclark.UUCP> jjc@jclark.UUCP (James Clark) writes:
>Groff has a composite character feature which helps with using ISO
>8859/1 with a device that doesn't have all the ISO 8859/1 characters.
>For example, suppose you want to be able to input an `a' circumflex
>using ISO 8859/1 (`a' circumflex has code 0342), and suppose you have
>a device which has the letter `a' and has a circumflex accent, but
>doesn't have an `a' circumflex as a single character.  Assuming `\*^'
>has been defined appropriately, you just have to do:

So does psroff.  In fact, I'd like to include the following plea to Nils-Peter
for consideration in DWB 3.2:

It is not necessary to limit the emit sequence in the ditroff width table
files to be just one character.  It would be advantageous to permit the
fourth field in the width table to consist of any arbitrary sequence
of characters, including backslash escape sequences (octal, maybe even
hex ala 1003.1 string definitions).  In this way, people can produce
composite characters for mundane things (such as O overstrike c) that
a printer doesn't support, as well as Latin-1 extensions on printers
that are short the special glyphs, and even get at characters you call
for by name (the characters not in the default postscript encoding
vectors).

Psroff has this feature: embedded in the built-in emit sequences (after all,
CAT troff can't extend it's basic character set) plus translation override
facilities, with multiple char codes, and indeed, invocations of Postscript
drawing routines or glyph-backspace-glyph sequences etc.

>.char \342 a\\*^

This feature is going to be appearing in Psroff soon.

As another plea, in the extremely unlikely event that somebody does get
to diddle SVR4 CAT Troff and sees this posting, PLEASE PLEASE PLEASE implement
"\!".  This directive should simply pump it's arguments out *in* the
CAT code stream, prefixing it with an unused CAT code (eg: 'M') and
terminated with a null or newline.  That would make my life complete.

[I enjoy the simple life ;-)]
-- 
Chris Lewis, Phone: (613) 832-0541
UUCP: uunet!utai!lsuc!ecicrl!clewis
Moderator of the Ferret Mailing List (ferret-request@eci386)
Psroff mailing list (psroff-request@eci386)

npn@cbnewsl.att.com (nils-peter.nelson) (01/03/91)

Chris Lewis requests an additional field in the width
tables to instruct troff how to manufacture the additional
8859 characters that are not in ASCII. Some of them appear
quite easy (e.g., the Yen sign looks like a Y with a line
through it, or the lower case letters with accent marks)
but others appears near-impossible. For example, all the
upper case letters will obliterate diacriticals, and the
Icelandic eth doesn't appear to have an obvious representation.
Is it worth doing half the job? (I.e., should we try to
implement those characters that can be done this way and
forget the others? Do a bad job on the others?)

My inclination is to support two and only two modes for
"production": PostScript and nroff. If you want ISO 8859
nroff, get an ISO 8859 terminal. The stuff about 7 bit
shorthand for 8 bit characters was intended for debugging
and interchange, not production. So far, no one has answered
my previous question: will this direction meet the needs
of the European market?

staff@cadlab.sublink.ORG (Alex Martelli) (01/04/91)

npn@cbnewsl.att.com (nils-peter.nelson) writes:
	...
:Chris Lewis requests an additional field in the width
:tables to instruct troff how to manufacture the additional
:8859 characters that are not in ASCII. Some of them appear
	...
:My inclination is to support two and only two modes for
:"production": PostScript and nroff. If you want ISO 8859
:nroff, get an ISO 8859 terminal. The stuff about 7 bit
:shorthand for 8 bit characters was intended for debugging
:and interchange, not production. So far, no one has answered
:my previous question: will this direction meet the needs
:of the European market?

Speaking as a European whose language needs very few diacritical marks
on letters (just a few accented vowels), I'd say the PS-or-nroff
direction would NOT be quite satisfactory; Chris's proposal looks VERY
much more attractive.  How a Turk would feel about it, I can't say.

Regarding your previous question re Laserjet printers, I must say that
they and their clones appear to be VERY well situated in the Italian
market; probably MORE popular than PS printers, even including the ones
you obtain by tweaking a Laserjet, maybe because such tweaks don't
always work well.  Lack of Laserjet support, in other terms, WOULD be
a (minor, but perceptible) handicap in the Italian market - although I
understand your arguments regarding why and wherefore you only want to
support Postcript output for ISO 8859.

---
Alex Martelli - CAD.LAB s.p.a., v. Stalingrado 53, Bologna, Italia
Email: (work:) staff@cadlab.sublink.org, (home:) alex@am.sublink.org
Phone: (work:) ++39 (51) 371099, (home:) ++39 (51) 250434; 
Fax: ++39 (51) 366964 (work only), Fidonet: 332/401.3 (home only).
-- 
Alex Martelli - CAD.LAB s.p.a., v. Stalingrado 53, Bologna, Italia
Email: (work:) staff@cadlab.sublink.org, (home:) alex@am.sublink.org
Phone: (work:) ++39 (51) 371099, (home:) ++39 (51) 250434; 
Fax: ++39 (51) 366964 (work only), Fidonet: 332/401.3 (home only).

clewis@ecicrl.UUCP (Chris Lewis) (01/07/91)

In article <1991Jan3.151843.24109@cbnewsl.att.com> npn@cbnewsl.att.com (nils-peter.nelson) writes:
|Chris Lewis requests an additional field in the width
|tables to instruct troff how to manufacture the additional
|8859 characters that are not in ASCII. Some of them appear
|quite easy (e.g., the Yen sign looks like a Y with a line
|through it, or the lower case letters with accent marks)
|but others appears near-impossible. For example, all the
|upper case letters will obliterate diacriticals, and the
|Icelandic eth doesn't appear to have an obvious representation.
|Is it worth doing half the job? (I.e., should we try to
|implement those characters that can be done this way and
|forget the others? Do a bad job on the others?)

My suggestion was to permit the fourth field to be more than one
character.  You're quite right in that this seems a bit half-assed,
However, in psroff it's fairly necessary to permit slight adjustment
of some character's placement and kludge up some additional characters.
(Eg: HPLJ box drawing characters don't precisely line up the
way troff expects them to).  Psroff, though, has a somewhat more
sophisticated scheme, in ditroff width table terms it looks like:

	char kern width <sequence> <xshift> <yshift> <scale>

Where <sequence> is a sequence of one or more bytes to emit for the
glyph (it can even be invocations of Postscript functions), x shift and
y shift are adjustment factors that are multiplied by the point size
and added to the X and Y coordinate when positioning the character, and
scale is a facter to apply to the point size. (eg: bullets are too small
in Postscript Roman).  They default to 0, 0 and 1.

In this way, box corners can be tuned etc.  It's not particularly necessary
with Postscript, but it certainly is with Laserjets.

|My inclination is to support two and only two modes for
|"production": PostScript and nroff. If you want ISO 8859
|nroff, get an ISO 8859 terminal. The stuff about 7 bit
|shorthand for 8 bit characters was intended for debugging
|and interchange, not production. So far, no one has answered
|my previous question: will this direction meet the needs
|of the European market?

Speaking unofficially (I'm on the ISO/CSA/Treasury/POSIX committee),
having ditroff accept 8-bit characters on input (also use a reasonable set
of \(sequences for those without the proper terminal), being able to successfully
search for them in the width tables, and emit the appropriate stuff seems
to be consistent (from the perspective of code, not tables) and sufficient
for Canada to be happy (Canadian Federal Govt. is pushing 8859-1 because of
French-English bilinqualism requirements).

From the perspective of only supporting Postscript on troff with 8859-1,
do you have any comments on the suggestions I made?  At the very least,
the other troff filters should *not* disallow 8-bit, and permit extension
of the character set by the user when/if the appropriate fonts are available.
Including the proper tables (and a pointer to where fonts might be obtained)
would be even better if you can't compress the fonts.

I think it would be a drastic mistake to only support 8859-1 on Postscript.
It's not that hard for HPPCL.
-- 
Chris Lewis, Phone: (613) 832-0541
UUCP: uunet!utai!lsuc!ecicrl!clewis
Moderator of the Ferret Mailing List (ferret-request@eci386)
Psroff mailing list (psroff-request@eci386)

clewis@ecicrl.UUCP (Chris Lewis) (01/07/91)

In article <1041@ecicrl.UUCP> clewis@ecicrl.UUCP (Chris Lewis) (me) writes:

>I think it would be a drastic mistake to only support 8859-1 on Postscript.
>It's not that hard for HPPCL.

*ESPECIALLY* considering the new HP Laserjets (III's) have two font sets
built in that the printer can scale to any size.
-- 
Chris Lewis, Phone: (613) 832-0541
UUCP: uunet!utai!lsuc!ecicrl!clewis
Moderator of the Ferret Mailing List (ferret-request@eci386)
Psroff mailing list (psroff-request@eci386)

lee@sq.sq.com (Liam R. E. Quin) (01/08/91)

npn@cbnewsl.att.com (nils-peter.nelson) writes:

>Our plan would be to provide support for the European standard
>(ISO 8859-1) character set only with the PostScript postprocessor,
>dpost.  [...]
>So, my question is, have the Europeans settled on PostScript as
>a standard for printers, or is there something else we should
>be supporting?

Speaking as somone who until very recently worked in a UK Unix company,
I would say that the LaserJet is literally orders of magnitude more
widespread.  The hight cost of PostScript printers, coupled with the high
mark-up involved in shipping to Europe, means that PostScript has not made
anything like the market penetration it seems to have achieved in North
America.  Small companies to whom we sold (sq)troff were more likely
to have LaserJets then LaserWriters, probably since the former could be
obtained for under (the equivalent of) US$3,000 in the UK, whilst Post-
Script printers started at a little over US$6,000.

I do feel that it might be worth your while investing in a little Market
Research.  [But perhaps I shouldn't be giving clues to the competition :-)]

And, as Chris Lewis points out, there is no reason why you shouldn't at
least compress the HP fonts if there are so many.  The AF ad AD sets used
to come with a reasonable Latin-1 (Roman 8) character set.  With a careful
font downloading scheme and a good driver, a LaserJet can easily out-perform
most PostScript printers for most common jobs.


Lee

-- 
Liam R. E. Quin,  lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337

gisle@ifi.uio.no (Gisle Hannemyr) (01/15/91)

In article <1991Jan3.151843.24109@cbnewsl.att.com> npn@cbnewsl.att.com (nils-peter.nelson) writes:

>  Chris Lewis requests an additional field in the width
>  tables to instruct troff how to manufacture the additional
>  8859 characters that are not in ASCII. Some of them appear
>  quite easy (e.g., the Yen sign looks like a Y with a line
>  through it, or the lower case letters with accent marks)
>  but others appears near-impossible. For example, all the
>  upper case letters will obliterate diacriticals, and the
>  Icelandic eth doesn't appear to have an obvious representation.
>  Is it worth doing half the job? (I.e., should we try to
>  implement those characters that can be done this way and
>  forget the others? Do a bad job on the others?)
>   My inclination is to support two and only two modes for
>  "production": PostScript and nroff. If you want ISO 8859
>  nroff, get an ISO 8859 terminal. The stuff about 7 bit
>  shorthand for 8 bit characters was intended for debugging
>  and interchange, not production. So far, no one has answered
>  my previous question: will this direction meet the needs
>  of the European market?

First when you say ISO 8859, do you actually mean the complete set of
ISO 8859 character sets, or just the ISO 8859/1? In any case, only the
latter is yet implemented in most PostScript printers.

One important source for information on how the European market views
character sets are the European Government OSI Profiles.
They are called thinks like GOSIP (UK), SOSIP (Sweden), NOSIP
(Norway) and EPHOS == European Procurement Handbook for Open Systems 
(European Community). The X/Open portability guide NLS also discusses
character sets.

I have spent some time studying these, and in brief, yes, they focus
on ISO 8859/1. And IMHO supporting ISO 8859/1 will meet the major
requirements for most of Western Europe, Note, however the following
three exceptions:

1) Slavic languages, and of cource cyrillic, as required in Easter
   Europe is not covered by ISO 8859/1.

2) Lappish -- a small minority language in the north of Finland, Sweden
   and Norway -- is not covered by ISO 8859/1 (but by ISO 8859/4).

3) A number of network communication protocols (most important X.400
   electronic mail) assumes ISO 6937/2, not ISO 8859/1 (ISO 6937/2 is
   s superset of ISO 8859/1-4).

--
Disclaimer: The opinions expressed herein are not necessarily those of my
            employer, not necessarily mine, and probably not necessary.
 
- gisle hannemyr  (Norwegian Computing Center)
  EAN:   C=no;PRMD=uninett;O=nr;S=Hannemyr;G=Gisle (X.400 SA format)
         gisle.hannemyr@nr.no                      (RFC-822  format)
  Inet:  gisle@ifi.uio.no
  UUCP:  ...!mcsun!ifi!gisle
------------------------------------------------