[comp.lang.postscript] FPU for PostScript

rcd@ico.isc.com (Dick Dunn) (01/09/90)

woody@rpp386.cactus.org (Woodrow Baker) writes:
> baffico@adobe.COM (Tom Baffico) writes:
> >...As to the benefit of having a FPU, for most controllers the performance
> > increase is actually quite small.
> 
> AS to the benifits of a FPU, I am certain that the speedup would matter.

I tend to agree with Tom, and while I haven't investigated it, I assume
that Adobe has.  Intuitively, it seems reasonable; I'd think the down'n'
dirty work of the font rendering would be mostly fixed point.  Probably a
lot of it is not numerical work at all.  Still, as I suggest, I ASSume that
Adobe knows what they're doing.

Woody - please submit some evidence that an FPU would make some useful
speedup in a PostScript controller.  I don't mean your conjecture; I mean
some real evidence, or solid reasoning.  Since, once again, you're trying
to tell Adobe that they don't know their business, the burden of proof is
on you.
-- 
Dick Dunn     rcd@ico.isc.com    uucp: {ncar,nbires}!ico!rcd     (303)449-2870
   ...Mr. Natural says, "Use the right tool for the job."

mcdonald@aries.scs.uiuc.edu (Doug McDonald) (01/10/90)

In article <1990Jan9.044252.18617@ico.isc.com> rcd@ico.isc.com (Dick Dunn) writes:
>woody@rpp386.cactus.org (Woodrow Baker) writes:
>> 
>> AS to the benifits of a FPU, I am certain that the speedup would matter.
>
>I tend to agree with Tom, and while I haven't investigated it, I assume
>that Adobe has.  Intuitively, it seems reasonable; I'd think the down'n'
>dirty work of the font rendering would be mostly fixed point.  Probably a
>lot of it is not numerical work at all.  Still, as I suggest, I ASSume that
>Adobe knows what they're doing.
>
>Woody - please submit some evidence that an FPU would make some useful
>speedup in a PostScript controller.  I don't mean your conjecture; I mean
>some real evidence, or solid reasoning.  Since, once again, you're trying
>to tell Adobe that they don't know their business, the burden of proof is
>on you.
>-- 

Adobe's business is BUSINESS - i.e. profits, as is that of printer
makers. Whether a FPU is worthwhile depends on how much more money
it will make for them, and a faster printer might not sell enough more.
Its complicated.

ken@cs.rochester.edu (Ken Yap) (01/10/90)

|Adobe's business is BUSINESS - i.e. profits, as is that of printer
|makers. Whether a FPU is worthwhile depends on how much more money
|it will make for them, and a faster printer might not sell enough more.
|Its complicated.

Why don't we base this discussion on something more substantial? I
repost an article I saved.  Hope this leads to better discussion.

From: adobe!taft@decwrl.DEC.COM (Ed Taft)
Date: 18 Feb 1986 1750-PST (Tuesday)
Subject: Hardware support for PostScript

Several people have suggested hardware enhancements (e.g., faster CPUs,
RasterOp chips, etc.) to improve the performance of PostScript
printers. Naturally, this is a topic of great interest to us at Adobe.
I'd like to share a few of our current thoughts with you. Please note
that I am talking only about current products; I am not speculating
about future ones.

Adobe's approach to PostScript has been first to define a fully general
software model for the programming language and page description
capabilities and only then to consider how hardware can be employed to
accelerate the software. Experience with a pure software implementation
of PostScript (of which the LaserWriter is a good example) gives us an
understanding of what parts of the implementation would benefit most
from hardware support.

There are three major activities that together account for most of the
execution time in Adobe's implementation of PostScript. These are:

  (1) Low-level raster manipulations, principally painting character
bitmaps and filling trapezoids located at arbitrary bit boundaries.
For typical pages, this activity dominates everything else if all
characters are already in the font cache.

  (2) Character scan conversion. This is a very compute intensive
operation because the original character definitions are at a high
level and are being pushed through the full PostScript graphics
machinery. In particular, there is a lot of arithmetic, both fixed and
floating point.

  (3) PostScript input scanning and interpretation. This includes
parsing the input stream, constructing tokens, looking up names,
pushing and popping stacks, etc. The amount of time consumed by this
activity varies considerably according to the type of page description
and the programming style. A text document that consists primarily of
strings and calls to simple PostScript procedures consumes relatively
little time in the interpreter; a document that executes a lot of
PostScript code for each mark placed on the page consumes
proportionately more.

Of course, I have deliberately left out time spent waiting for input data or
waiting for the print engine. The effect of a slow communication channel or
a slow print engine can completely dominate everything else. More to the
point, obtaining the best performance requires the ability to perform
communication, execution, and printing activities in parallel.

The above three activities benefit from significantly different kinds of
hardware support. (Of course, in a strictly software implementation, a
faster CPU should speed all three activities.) Considering them in order:

  (1) Simple hardware for shifting and masking makes a substantial
difference here; the full generality of RasterOp is not needed. The idea is
to minimize the number of CPU instructions and memory cycles needed to
perform simple, repetitive bit moving operations. A shifter-masker is
included in the Adobe Redstone controller, versions of which are used in all
present PostScript printers except the LaserWriter. This activity is one
that would benefit greatly from having a separate, parallel processor; its
interface with the rest of PostScript would be quite simple.

  (2) Efficient arithmetic is of particular importance here. Also, since a
vast amount of code is being executed and all of it is written in a
high-level language (C in the case of Adobe's implementation), the overall
quality of compiled code is important. Apart from arithmetic, no single
component dominates, so it's not practical to assembly-code much of it.

  (3) Here is a place where some special hardware and/or microcode might
help. The PostScript interpreter's data structures and algorithms are
sufficiently straightforward that custom hardware may be practical. Whether
or not this makes sense economically depends on how much time is spent in
the interpreter relative to everything else, which, as I said, is highly
application dependent.

amanda@mermaid.intercon.com (Amanda Walker) (01/10/90)

In article <1990Jan9.182332.8554@cs.rochester.edu>, ken@cs.rochester.edu (Ken
Yap) writes:
> Why don't we base this discussion on something more substantial? I
> repost an article I saved.  Hope this leads to better discussion.

I knew this discussion was getting familiar.  Thanks!

> Experience with a pure software implementation
> of PostScript (of which the LaserWriter is a good example) gives us an
> understanding of what parts of the implementation would benefit most
> from hardware support.

Another thing that I would imagine it's good for is that by running
the implementation on something like a UNIX box, you can profile it
and actually look at where the time goes.  This is critical for finding
out what will actually make the most difference when you speed it up.

>   (1) Low-level raster manipulations, principally painting character
> bitmaps and filling trapezoids located at arbitrary bit boundaries.
> For typical pages, this activity dominates everything else if all
> characters are already in the font cache.

This sounds like a good candidate for hardware.  The experience of some
of the PostScript clone controllers seems to show that a TI 34010 graphics
processor or an AMD 29000 can significantly improve the interpreter's
ability to lug bits around.

>   (2) Character scan conversion. This is a very compute intensive
> operation because the original character definitions are at a high
> level and are being pushed through the full PostScript graphics
> machinery. In particular, there is a lot of arithmetic, both fixed and
> floating point.

This looks like the least tractable as far as throwing off-the-shelf
hardware at it is concerned.  A good FPU (like a 68881 or 68882), perhaps
with some hand-tuned code for multiplying things by the CTM, would be my
guess at the best approach.

>   (3) [...] A text document that consists primarily of
> strings and calls to simple PostScript procedures consumes relatively
> little time in the interpreter; a document that executes a lot of
> PostScript code for each mark placed on the page consumes
> proportionately more.

This seems to indicate that the major avenue of improvement is not the
tokenizer but the code that runs through executable arrays.  Now, these
look a lot from the outside like pretty conventional token-threaded code.
A change to subroutine-threaded code could make a significant improvement
in execution speed, although it might take more space and could introduce
compatibility problems with code that does 'put's into executable arrays.

> More to the
> point, obtaining the best performance requires the ability to perform
> communication, execution, and printing activities in parallel.

Yes.  One thing I noticed from using a Dataproducts LZR-2665 was that even
though its controller didn't seem too much faster than a LaserWriter at
imaging, the aggregate throughput was often much higher since it has two
page buffers, and could be imaging a page while the previous one was still
being sent out to the marking engine.  Hardware buffering for serial ports
(and maybe even a second processor for AppleTalk or Ethernet) would also
reduce this kind of waiting.  There's nothing like having a Linotron 300
ignore its input stream for 10 seconds per page to make you appreciate an
I/O processor :-).

> A shifter-masker is
> included in the Adobe Redstone controller, versions of which are used in all
> present PostScript printers except the LaserWriter. This activity is one
> that would benefit greatly from having a separate, parallel processor; its
> interface with the rest of PostScript would be quite simple.

Well, aside from the fact that this is a little dated, advances in the 68000
family can also gain some of these benefits.  The 68020 and 68030, for example,
are much better at shifting and doing bit field operations that the 68000
was.

Amanda Walker
Speaker To PostScript
InterCon Systems Corporation
--

woody@rpp386.cactus.org (Woodrow Baker) (01/10/90)

In article <1990Jan9.044252.18617@ico.isc.com>, rcd@ico.isc.com (Dick Dunn) writes:
> 
> I tend to agree with Tom, and while I haven't investigated it, I assume
> that Adobe has.  Intuitively, it seems reasonable; I'd think the down'n'
> dirty work of the font rendering would be mostly fixed point.  Probably a
> lot of it is not numerical work at all.  Still, as I suggest, I ASSume that
> Adobe knows what they're doing.
See the repost of the Feb article from Adobe.  font outlines do have a mix
of FP and fixed-point.  I don't have an 68881 FPU manual handy, but i'm
sure that the hardware is MUCH faster than software. 'nuff said.
NOT a conjecture.

Cheers
Woody

> Dick Dunn     rcd@ico.isc.com    uucp: {ncar,nbires}!ico!rcd     (303)449-2870
>    ...Mr. Natural says, "Use the right tool for the job."

mikep@crackle.amd.com (Mike Parker) (01/11/90)

I'll just remove all attributions for fear of getting them all wrong...
| 
| > Experience with a pure software implementation
| > of PostScript (of which the LaserWriter is a good example) gives us an
| > understanding of what parts of the implementation would benefit most
| > from hardware support.
| 
| Another thing that I would imagine it's good for is that by running
| the implementation on something like a UNIX box, you can profile it
| and actually look at where the time goes.  This is critical for finding
| out what will actually make the most difference when you speed it up.

There are far too many other first order effects.  I've spent a lot of
time trying to get a handle on where PS spends its processing time and
I do have some hard data (was it Dick Dunn that wanted numbers?).  First,
the processor makes a big difference.  I work for AMD so I really understand
the Am29000 much better than others, but it is clear that an external
shifter might not help the Am29000 as much as say the 68000.  Case in point,
there are a lot of bit-blt accelerators available for 68000 (like Cirrus chip)
but our bit-blt code for Am29000 is completely memory bound, the only
external hardware that would help is a faster memory system.

The memory system is another key factor.  One example:  On one particular
board, the Am29000 running Phoenix clone with the Am29027 FPU is 46x a
Laserwriter Plus while without the FPU it is 30x the Plus.  But it would be all wrong to say that the FPU gives a 50% boost to performance because we have
other boards where the boost is much larger and others where it is much smaller.

Choice of software is also a key contributor.  Another clone, Pipeline
Associates, goes from 5.9x the NTX without the FPU to 10.2x the NTX with the
FPU on the same board as the previous Phoenix numbers.  So it would appear that
Pipeline is more FP dependent than Phoenix.  I'm told by people who probably 
do not know that Adobe is very FP independent, so maybe they'd see less of
a hit.  Real soon now I'll be able to quote similar numbers for Bauer/uSoft.

Further evidence that the probelm is SW vendor dependent is that the
Pipeline people worked long and hard to improve performance with and
without the FPU and were able to make very large differences and to
close the gap significantly.  In particular we found that basic add,
sub, mul, div were not nearly the culprits that a certain few transcendentals
were.  Hand coded transcendental routines from Pipeline made huge
performance differences in the non-FPU case for some files.  Pipeline
already had the advantage of a pure integer font rendering mechanism
(Nimbus-Q), they changed their bezier solver to pure integer as well
as a few other key routines.  It was a lot of work and many less caring clone
vendors ahven't done the exercise.  Being older and bigger, it stands to
reason that Adobe has worked pretty hard on this.

So profiling on a Laserwriter, or worse yet a UNIX box which might have
a memory system very unlike a printer isn't really going to give data
applicable to PostScript printers as a whole.
| 
| >   (1) Low-level raster manipulations, principally painting character
| > bitmaps and filling trapezoids located at arbitrary bit boundaries.
| > For typical pages, this activity dominates everything else if all
| > characters are already in the font cache.
| 
| This sounds like a good candidate for hardware.  The experience of some
| of the PostScript clone controllers seems to show that a TI 34010 graphics
| processor or an AMD 29000 can significantly improve the interpreter's
| ability to lug bits around.

Thanks for the plug (all others flame me when the advertising content
exceeds the hard data).  I'm not so sure that the low-level raster
stuff dominates.  I've been told that the split is nearly 50/50.  I tend to
agree with the earlier poster who said that it varies greatly for different 
pages.  But I have hard evidence that it isn't all that low in the
case cited where the page is all text and all hits in font cache.  We have
a 9 page pure text document that we have run on all sorts of configurations.
I believe that both Pipeline and Phoenix use the same blit code (supplied
by AMD) and yet they get very different results.  The first few pages show
large differences due to differences in character rendering time
for font cache misses, and you can see the time per page curve down
exponentially to an asymptote at about the fourth page.  At the asymptotee,
the Pipeline runs at roughly 0.5 seconds per page on the same exact
hardware where the Phoenix code runs at about 0.75 seconds per page.  I
can't see where any of the difference is anything but "interpretation"
(as opposed to raster file manipulation).

I have a plan and would like some input on it's validity.  We'll take the
same exact hardware except we'll change the serializer crystal so we can
run at 400 dpi and we'll tell the code to run at 400 dpi.  We'll run
a variety of pages at both resolutions.  It seems like some simple algebra
will then give us the intrepretation/raster split.  We'll have 87% more
pixels so if we take 20% longer to run a file then raster processing time
must be 20/87 or 33% of the total task.  If enough of you say that the
experiment is valid, I'll run it, otherwise I'll run it and just not tell
anybody.


Please blame all gross spelling errors on a noisy line...

mikep
Mike Parker
mikep@amdcad.AMD.COM