[comp.windows.x] Purdue speedups to X11 Release 3 mfb code

spaf@cs.purdue.EDU (Gene Spafford) (11/18/88)

I have just sent to comp.sources.x my patches to X11 Release 3 to
speed up the mfb (monochrome) server performance.  These patches may also
be ftp'd from mordred.cs.purdue.edu (file: ~ftp/pub/X11/Purdue-speedups.mfb.Z)
and expo.lcs.mit.edu (file: ~ftp/contrib/Purdue-speedups.mfb.Z).  I hope they
will be available on some other systems soon.

I hope to do a similar set of optimizations for the color (cfb) code sometime
in the next few weeks.

Enclosed is the README file from the distribution.


Intro
-----
These are the Purdue speedups for X11, Release 3.  These apply only to
B/W servers (for the most part); similar patches for the color code
should be released sometime in the next few weeks.

Installation
------------
The patches in this archive should all be applied to the files in the
server/ddx/mfb directory.  You need to set the symbol PURDUE in your
macros or site.def file (e.g.,
#define OptimizedCDebugFlags -O -DPURDUE") to use them.
You can also patch your server/ddx/mfb/Imakefile as follows:

*** server/ddx/mfb/Imakefile.orig	Thu Nov 17 15:52:45 1988
--- server/ddx/mfb/Imakefile		Thu Nov 17 15:52:45 1988
***************
*** 19,24 ****
--- 19,25 ----
  	 mfbpawhite.o mfbpablack.o mfbpainv.o mfbtile.o \
           mfbtewhite.o mfbteblack.o mfbmisc.o mfbbstore.o
  
+ DEFINES = -DPURDUE
  STD_DEFINES = ServerDefines
  CDEBUGFLAGS = ServerCDebugFlags
  INCLUDES = -I. -I../../include -I$(INCLUDESRC)

Similar patches must be made to ddx/mi/Imakefile and ddx/cfb/Imakefile
since ddx/mfb/maskbits.h is included in files in those directories.

Whichever change you make, you will need to cd to the server directory, then:
make Makefile; make Makefiles depend; make


Description
-----------
The changes in these patches fall into a few, similar categories:
    * Optimized or added bitmasking functions, taking advantage of
      properties known to exist for certain arithmetic operators
      and domains of input;
    * Replacing calculated bitmasks with table lookups
    * Use of Duff's device in some places where it looks beneficial
    * Reordering of code to share variables or move invariants out of
      loops.

The changes seem to make some significant (but sometimes difficult to
measure objectively) impact on the speed of most operations.  This
speedup will differ based on your job mix and machine configuration.
Some operations appear to take up to 35% less cpu time to complete.
Incremental measurements with gprof, time, and other tools show each
change to have a positive overall effect on the server efficiency.  In
particular, painting windows and drawing lines appears to be much
faster.  An "ico -r" is obviously faster and smoother, as is tiling the
root window (on my Sun 3/60).  Note that my measurements have not been
done with any formal benchmarking, so they might still benefit from
some tuning.  In particular, making the "Duff" macro unroll more or
less items might be beneficial.  On the MacII, for instance, a Duff's
Device of only 4 is better than the 8 used in the patches enclosed.

Interestingly enough, the binary after installing these patches also
seems *smaller*.

The changes have been generated in a machine-independent way and should
work on any other machine, although I have not yet been able to try
them on any other kind of cpu.  I have thoroughly tested them on
various Sun 3 machines, and all my changes have been to optimize for
that architecture (68020).  If I blew it, or if you have more
portable/better versions of these changes, please share them with me
and I'll use them in further releases.

[I think this code is used in the Apollo servers, and since they also
 use a 68xxx chip, these changes should work there too.  They have been
 tested on a Mac II and work there (and act as speedups). I would love
 to see how these work on a Vaxstation, but DEC wouldn't even sell one
 to me at retail!  If anyone at DEC can explain what DEC has against
 us/me here at Purdue, I'd love to talk with you. We're all mystified. ]

Future Work
-----------
Some optimizations could still be done on this code.  Other than
changes to the cfb code, these include:
    * putting in some hacks to enhance speed for certain compilers.
      In particular, gcc can produce incredibly good code, and some
      small tuning of the mfb/cfb code could result in significant
      improvements.
    * putting in custom assembler macros for commonly used instructions,
      such as the bits routines, abs, round, etc.
    * algorithmic changes that radically change the nature of how
      some things are done in the server; this amounts to a rewrite
      of portions of the server.

I may get around to doing some of these in the next few months.  If
not, I hope others will and then share their results with the rest of
us.

Please share your comments on this package with me -- I'd like to know
what else *we* can do to make this server a more efficient piece of
code.

Thanks to:
----------
Acknowledgments Sam Kimery of PURDUE ECN helped me develop some of the
optimizations in the first release of these fixes (for X11R2).  Terry
Donahue of Project Athena contributed some server fixes with the X11R3
release that helped focus my attention on some sections of code,
although I did not use any of his changes in these patches.  The
Purdue/Florida Software Engineering Research Center provided the
machines and funding that allowed me to do this tinkering.
Thanks to Jim Fulton for testing the changes on a Mac II, and for
recommending some format changes (including the "SmallDuff").

Gene Spafford
spaf@cs.purdue.edu		11/17/88



PS.  Late breaking numbers from someone (who wishes to remain
anonymous) who has applied these changes to the X11.R3 code on a Vax
with an Xqvss display:

>Date: Thu, 17 Nov 88 15:24:53 EST
>To: spaf
>Subject: Xqvss numbers
>
>Okay, quicky results:
>
>	filled rectangles:  25% faster
>	lines:  no real change
>	image text 8:  slightly slower
>	text 8:  slightly slower
>	rectangles:  slightly faster
>	copy area:  20% faster
>
>It seems like it would be interesting to figure out what helps where and
>set up appropriate #ifdefs after people have had a chance to poke at it.

So, these changes also work on a Vax, but should be tuned a little
differently.  Let me encourage people who can access Vaxen to contribute
such fixes.
-- 
Gene Spafford
NSF/Purdue/U of Florida  Software Engineering Research Center,
Dept. of Computer Sciences, Purdue University, W. Lafayette IN 47907-2004
Internet:  spaf@cs.purdue.edu	uucp:	...!{decwrl,gatech,ucbvax}!purdue!spaf

spaf@cs.purdue.edu (Gene Spafford) (11/18/88)

The patches are also on gatekeeper.dec.com in the file
~ftp/pub/X11.contrib/Purdue-speedups.mfb.Z

Enjoy.
-- 
Gene Spafford
NSF/Purdue/U of Florida  Software Engineering Research Center,
Dept. of Computer Sciences, Purdue University, W. Lafayette IN 47907-2004
Internet:  spaf@cs.purdue.edu	uucp:	...!{decwrl,gatech,ucbvax}!purdue!spaf

pfh@pai.UUCP (Peter Hill) (11/27/88)

In article <5462@medusa.cs.purdue.edu>, spaf@cs.purdue.EDU (Gene Spafford)
writes:
> 
> I have just sent to comp.sources.x my patches to X11 Release 3 to
> speed up the mfb (monochrome) server performance.

Thanks, Gene!  The speedups are very nice indeed.  Now for a problem report:


VERSION:
    X.V11R3 + patches 1-2 + Purdue-speedups.mfb

CLIENT MACHINE and OPERATING SYSTEM:
    Sun 3/160M, SunOS 3.2

DISPLAY:
    Sun BW2

AREA:
    mfb speedups

DESCRIPTION:
    Neither mfbsetsp.c nor mfbbitblt.c would compile under SunOS 3.2,
    apparently because of the new putbitsrop() macro in maskbits.h.
    See code below for details.

FIX:
    Perhaps there is a better/easier way to fix this.  Below is a
    context diff of my hacks to maskbits.h.  This patch can be applied
    after applying maskbits.h.patch from Purdue-speedups.mfb.


------cut here-----
*** ddx/mfb/maskbits.h.purdue	Fri Nov 25 19:52:54 1988
--- ddx/mfb/maskbits.h	Sat Nov 26 10:35:38 1988
***************
*** 353,363 ****
      else \
      { \
  	int m = 32-(x); \
  	register unsigned int *ptmp_ = (unsigned *) (pdst)+1; \
! 	*(pdst) = (*(pdst) & endtab[x]) | (t2 & starttab[x]); \
  	t1 = SCRLEFT((src), m); \
  	DoRop(t2, rop, t1, *ptmp_); \
! 	*ptmp_ = (*ptmp_ & starttab[n]) | (t2 & endtab[n]); \
      } \
  }
  
--- 353,366 ----
      else \
      { \
  	int m = 32-(x); \
+ 	register int t3; \
  	register unsigned int *ptmp_ = (unsigned *) (pdst)+1; \
! 	t3 = t2 & starttab[x]; \
! 	*(pdst) = (*(pdst) & endtab[x]) | t3; \
  	t1 = SCRLEFT((src), m); \
  	DoRop(t2, rop, t1, *ptmp_); \
! 	t3 = t2 & endtab[n]; \
! 	*ptmp_ = (*ptmp_ & starttab[n]) | t3; \
      } \
  }
  
------cut here-----

-- 
______________________________________________________________________________
Peter Hill                          pfh@pai.mn.org             +1 612 894 0313
Prime Automation, Inc.              ...{sun!tundra,umn-cs!hall,bungia}!pai!pfh