[comp.windows.x] The easiest way to speed up X11R4

name@portia.Stanford.EDU (tony cooper) (02/08/90)

Suppose I wanted to spend, say, a day making X11R4 faster for my
particular hardware. What would be the best way to spend my time?
I am interested in 8 bit color speedups at the expense of portability.

For example:

Is there a piece of code that is executed everytime the server
does something? I could convert it to assembly language.

Is there some device-independent code that could be converted to
device-dependent code?

The README for the cfb directory says the cfb code is "very slow".
Is the cfb directory a good place to start? Which are the 20 most
time-critical lines?

Which are the 20 most time-critical lines in the whole of X11R4?

I have done this sort of thing before with spectacular results. I
once sped up a program an order of magnitude by changing one line
of C code inside a loop into assembly language and using registers.
It seems to me that X11 has a lot of potential for this kind of
speedup since it is so device independent and since all graphics
operations boil down to just the simple turning on or off bits.
There must be some code somewhere that does the nitty gritty bitty
stuff at a low level that I can get at.

For those interested in receiving a copy, I'll be doing this for
the MC68030. Even more specifically, it will be for the Macintosh II.

Tony Cooper
tony@popserver.stanford.edu

keith@EXPO.LCS.MIT.EDU (Keith Packard) (02/09/90)

> Suppose I wanted to spend, say, a day making X11R4 faster for my
> particular hardware. What would be the best way to spend my time?
> I am interested in 8 bit color speedups at the expense of portability.

As usual, this depends almost completely on what you will be using the
server for.  The R4 server is actually pretty good at a wide range of
common tasks.

> The README for the cfb directory says the cfb code is "very slow".

The README file was not updated for R4.  Look at the CHANGES file and
you'll see a more promising comment:

"This directory now provides a real implementation for 8-bit frame buffers,
 driving the frame buffer at memory bandwidth for many operations"

The most heavily tuned operations are BitBlt, text painting, line drawing
and rectangle filling.  For these operations, you'd be hard pressed to
get much performance increase even coding them in assembly.

> For those interested in receiving a copy, I'll be doing this for
> the MC68030.

The R4 server has code which is tuned for the 68020 family; it allows you
to specify a few machine characteristics which guide the compiler to the
correct bits of code.

> Even more specifically, it will be for the Macintosh II.

This is the bad news.  The Mac II frame buffer cards which sit on the NuBus
have memory latency of ~1us per access; and no special block-mode optimizations
which give them a bandwidth of 4Mb/sec (4 bytes/access).  Nothing the R4 server
does can help this out; you end up with a server which runs 2.5times slower
than a Sun 3/60.

Some of the newer Macintoshes have on-board frame buffers.  I expect that
eliminating the NuBus would give them more respectable frame buffer latency
numbers, but I haven't ever had a chance to bench mark them running our code.

If you want to start tuning the R4 code, get a copy of x11perf and start
measuring.  As usual, any changes you make would be welcome at MIT if directed
at 'xbugs@expo.lcs.mit.edu'.

Keith Packard
MIT X Consortium