[comp.windows.x] Release 2.0 of Purdue/Purdue+ Speedups

spaf@PURDUE.EDU (Gene Spafford) (01/23/89)

Version 2.0 of the Purdue and PurduePlus X11 server speedups are now
available for ftp.  This release integrates both sets of patches,
along with some new changes and bug fixes.  Included is special code
to work around the Sun 3/60+CG4 firmware bug causing bitrot when using
the bfins instruction.

You can ftp a copy from expo.lcs.mit.edu; get the file
"contrib/Purdue.2.0-tar.Z".  You can also ftp a copy from
mordred.cs.purdue.edu; get the file "pub/X11/Purdue.2.0-tar.Z".  The
patches will be submitted to the comp.sources.x group for publication.

Enclosed are the "Timings" and "README" files from this distribution.

-----------------------------Timings----------------------------------
The following are some rough timing figures for the performance
improvement possible using the Purdue/Purdue+ patches.

System:
	Sun 3/60 with CG4, Sun OS 3.4, 8Mb memory, local disk
	All tests run with "-mono" switch on server.
	All compiles done with software floating point.

Original version:
	X11R3 server, patches 1-4 applied.
	Compiled with Sun cc compiler, -O -fsoft-float options

New version:
	X11R3 server, patches 1-4 applied.
	Purdue/Purdue+ 2.0 patches applied.
	Compiled with 1.31 GCC, with options:
	    -O -traditional -msoft-float -fstrength-reduce -finline-functions
	Linked with Sun 3.4 "malloc" recompiled with GCC
	    (same options as above)

Run against canned exercise, including:
	xload, xphoon, xsetroot -bitmap, xterm (with and without -j),
	ico -r, and others.

Original version user+sys time: 289.0 seconds, startup to shutdown.
New version user+sys time: 218.4 seconds, startup to shutdown, a 24+%
    improvement.

Some selected excerpts from grof for the two versions:

Old version                        New version                      
-----------			   -----------                      
   total			      total                         
 ms/call name			    ms/call name                    
  42.50  _mfbUnnaturalTileFS	     31.34  _mfbUnnaturalTileFS     
  10.26  _mfbDoBitblt		      7.07  _mfbDoBitblt            
   6.62  _mfbImageGlyphBltWhite	      4.19  _mfbImageGlyphBltWhite  
   4.98  _mfbTEGlyphBltWhite	      3.70  _mfbTEGlyphBltWhite     
   4.50  _mfbPaintWindow32	      3.88  _mfbPaintWindow32       
   1.71  _mfbPushPixels		      1.43  _mfbPushPixels          
   1.15  _mfbBlackSolidFS	      0.78  _mfbBlackSolidFS        
   0.71  _mfbLineSS		      0.52  _mfbLineSS              
   0.37  _mfbSolidBlackArea	      0.32  _mfbSolidBlackArea      
   0.21  _mfbBresS		      0.12  _mfbBresS               
   0.12  _malloc		      0.05  _malloc                 

-----------------------------README----------------------------------
About Purdue/PurduePlus 2.0
---------------------------
This is the second release (for X11R3) of a set of changes to the
frame buffer code of the X11 sample server.  These changes are
designed to make the server faster for B/W for most machines, and Vax
and 68020 machines (e.g., MacIIs, Apollos, Sun 3s) in particular.
(Patches for the color fb will follow, eventually.)

The changes make a significant (but sometimes difficult to measure
objectively) impact on the speed of most operations.  This speedup
will differ based on your job mix and machine configuration.  Some
operations appear to take up to 50% less cpu time to complete.
Incremental measurements with gprof, time, and other tools show each
change to have a positive overall effect on the server efficiency.  In
particular, painting windows and drawing lines appears to be much
faster.  An "ico -r" is obviously faster and smoother, as is tiling
the root window.

Interestingly enough, the binary after installing these patches also
seems *smaller*.

This second release is basically an integration of the first release
of each of the Purdue and Purdue+ releases, along with new
optimizations and bug fixes.  Some special changes have been made to
take advantage of optimizations possible when using the GCC compiler.
These have all been ifdef'd on the symbol __GNUC__ so they will not
interfere with compilation using other compilers.  However, if you
have the GCC compiler, you can take advantage of these (and they are
well-worth the effort!).  The GCC-specific changes are noted below.

Motivation & Changes
--------------------
The generic server shipped with X11R3 is designed to run on many
different machines.  It was not written with speed in mind, although
some efforts were made at optimization.  Looking at the code reveals
a number of places where changes could be made to make the code
faster.  These include:
    * Optimized or added bitmasking functions, taking advantage of
      properties known to exist for certain arithmetic operators
      and domains of input;
    * Replacing calculated bitmasks with table lookups
    * Use of Duff's device in some places where it looks beneficial
      (note: the first release of these patches used a Duff's device
       or order 8.  Tests with Sun 3s and MacIIs show that an order
       4 device gives better performance, probably due to caching.)
    * Reordering of code to share variables or move invariants out of
      loops.
    * Expanding some code inline instead of doing calls or loops
    * Taking advantage of knowledge about *when* code is called.

We have tried to make these changes in a way that is maintainable
and easily marked; every modification is enclosed in ifdef's on
the symbol PURDUE.

Installation
------------
The patches in this archive should all be applied to the files in the
server/ddx/{mfb,cfb,mi} and server/include directories.  These are all
formed to apply to *unmodified* X11R3 server sources.  Using Larry Wall's
"patch" program, you can apply them all as follows:
	server="path to your X11 server source directory"
	for patch in *.patch
	do
	    patch -l -N -p -d $server < $patch
	done

Next, you need to set the symbol PURDUE (and possibly NO_3_60_CG4, see
"A GCC & Sun Problem," below) in your site.def file (e.g., #define
OptimizedCDebugFlags -O -DPURDUE") to use them.  You can also patch
your Makefiles (e.g., server/ddx/mfb/Imakefile) as follows:

    *** server/ddx/mfb/Imakefile.orig	Thu Nov 17 15:52:45 1988
    --- server/ddx/mfb/Imakefile		Thu Nov 17 15:52:45 1988
    ***************
    *** 19,24 ****
    --- 19,25 ----
	     mfbpawhite.o mfbpablack.o mfbpainv.o mfbtile.o \
	       mfbtewhite.o mfbteblack.o mfbmisc.o mfbbstore.o

    + DEFINES = -DPURDUE
      STD_DEFINES = ServerDefines
      CDEBUGFLAGS = ServerCDebugFlags
      INCLUDES = -I. -I../../include -I$(INCLUDESRC)

Similar patches must be made to ddx/mi/Imakefile and ddx/cfb/Imakefile
since ddx/mfb/maskbits.h is included in files in those directories.

Note: The change to ddx/mi/miarc.c is to fix a bug, and you may
install it if you wish.  The bug has been submitted to the X folks but
not yet officially "blessed."  The change to ddx/mfb/mfbimggblt.c is
also a bug fix for a clipping problem, and it too has yet to be
officially sanctioned, but it works for us so you're welcome to use it
if you see fit.

Whatever changes you make, you will need to cd to the server
directory, then:
	make clean Makefile; make Makefiles depend; make

GCC Notes
---------
A working server can be built using gcc 1.31 and the flags
"-O -traditional -finline-functions -fstrength-reduce" and either of
"-msoft-float" or "-m68881", as appropriate.  gcc 1.32 is known
to produce bad code and should be avoided.  

GCC-specific changes in these patches include:
    * Special "asm" instructions to use bitfield instructions on
      Vaxen and 68020 machines in place of shift/mask combinations
      in the getbits/putbits macros.
    * Using the builtin alloca function instead of the library
      alloca call.

GCC-specific changes have been marked with ifdef __GNUC__.  Note that
the change to include/os.h does *not* have the PURDUE symbol
associated with it since it is dependent on only the compiler being
used.

If you have source code available, consider compiling your heap code
(malloc, free, calloc, etc) with gcc and including it with the server.
You can either do this with the entire library, or you can copy the
heap source code into the os/4.2bsd directory and recompile it there.
Under SunOS 3.4 on a Sun 3, recompiling the heap code with gcc (using
the -finline-functions and -fstrength-reduce options) results in more
than a 100% speedup in heap operations.

Also, using GCC to compile os/4.2bsd/oscolor.c can result in problems
unless you take corrective measures.  The problem lies with the fact
that GCC returns structures differently as function values than does
cc-derived code.  The symptom of this problem is that a GCC-compiled
server will have totally black (or white) screens with no observable
text.  To fix the problem, either compile oscolor.c with the regular
"cc" compiler, or compile the dbm library with gcc and link against
that.  You can also apply the enclosed dbm-gcc.h.patch file to your
/usr/include/dbm.h file and compile oscolor.c with gcc as normal.

If you have source code available, copy both the dbm and heap files
to os/4.2bsd, then modify the Imakefile as follows:

    *** Imakefile.orig	Sun Oct 30 22:46:56 1988
    --- Imakefile	Wed Jan 18 22:49:37 1989
    ***************
    *** 13,23 ****
       */

      #ifndef OtherSources
    ! #define OtherSources
      #endif

      #ifndef OtherObjects
    ! #define OtherObjects
      #endif

      BOOTSTRAPCFLAGS = 
    --- 13,23 ----
       */

      #ifndef OtherSources
    ! #define OtherSources dbm.c malloc.c
      #endif

      #ifndef OtherObjects
    ! #define OtherObjects dbm.o malloc.o
      #endif

      BOOTSTRAPCFLAGS = 


A GCC & Sun Problem
-------------------
One change in particular should be noted.  The inclusion of the
GCC-defined "asm" statements to speed up bit operations is a big win
for most 68020 machines and Vaxen.  Unfortunately, the Sun 3/60+CG4
combination as sold by Sun has a bug in the firmware that causes
writes using the "bfins" instruction to fail in some circumstances.
This has been reported to Sun as bug report #1016963.  You should
enquire of your Sun support people if some kind of fix is available.

In the meantime, this set of patches is distributed with a software
workaround.  BY DEFAULT the workaround is installed.  If you are
building a server that will NEVER run on Sun 3/60+CG4 machines, then
be sure to define the symbol NO_3_60_CG4 along with the symbol PURDUE.
I.e., your site.def file should include the flags -DPURDUE
-DNO_3_60_CG4.  You only need to do this if you build the server with
the gcc compiler and you are building for a mc68020 processor; the
optimizations involved are automatically enabled for other
architectures.  Once Sun develops an ECO to fix the bug, this flag can
also be turned on for 3/60+CG4 machines.  This would be nice, because
without the fix, those machines can only use 1/2 of the speedups.

Color Machines
--------------
As time allows, we will examine similar changes to the cfb code for
color machines.  Stay tuned.

Questions
-----------
We will try to respond to any of your questions or comments about
these patches -- just send us some e-mail.  We would also like to hear
about any of your own enhancements, benchmarks, etc.  Enjoy!

Gene Spafford		&	Martin Friedmann
spaf@cs.purdue.edu		martin@citi.umich.edu
1/22/89

Thanks to:
----------
Sam Kimery of PURDUE ECN helped develop the optimizations in
the first release of these fixes (for X11R2).  Terry Donahue of
Project Athena contributed some server fixes with the X11R3 release
that helped focus our attention on certain sections of code.  The
Purdue/Florida Software Engineering Research Center provided the
machines and funding that allowed Spaf to do his tinkering.  Thanks
to Jim Fulton for testing the changes (Release I) on a Mac II, and for
recommending some format changes.  Rusty Sanders of Megatek helped
isolate the bug in the Sun 3/60+CG4 combo.

Our thanks and apologies to anyone else we forgot to mention.