[comp.windows.x] Some Xsun Benchmarks

phil@BRL.MIL (Phil Dykstra) (12/19/88)

			Some Xsun Benchmarks

I was curious to see how much the posted Xsun speedups were actually
helping, so I ran some benchmarks.  The results were interesting.  The
shock occurred when I profiled the server.  This message will detail
some of those results.

The System Used:

Sun 3/50 w/68881 SunOS 3.4 w/Berkeley TCP.  The Sun 3.4 C compiler and
GNU gcc 1.31 were used.  X11R3 with and without the Purdue and Purdue+
speedups was tested.

The Servers:

cc	- straight X11R3 server with Sun cc -O
ccP	- above but with Purdue speedups
gcc	- straight X11R3 server with gcc -O
gccP	- Purdue speedups with gcc -O
gccP+	- Purdue and Purdue+ speedups with gcc -O

	text	data	bss	dec	hex
cc	532480	40960	18040	591480	90678
ccP	524288	40960	21880	587128	8f578
gcc	442368	32768	17792	492928	78580
gccP	442368	32768	21632	496768	79480
gccP+	442368	32768	21644	496780	7948c

The Benchmarks:

Lacking anything else, I used the "blitstone" package for testing performance
(available on expo.lcs.mit.edu in contrib/stones.tar).  Those routines are
briefly described below.  I used the default parameter for each benchmark.
I also benchmarked the "maze" program by calling srandom(1) to get repeatable
random numbers, but none of the server configurations made any noticeable
improvement (*thick* line performance has not gone up in these patches).
Also timed was how long it took to "log in" to the server, but the server
itself only amounts to about 6 seconds of the roughly 30 second startup time
in my environment.  If anyone has any other benchmarks please share them.

charstones:   Writes many characters one at a time.  Character per second.
stringstones: Writes a 96 character string many times.  Character per sec.
linestones:   Draws many single lines ("stringart").  Lines per second.
segstones:    Draws a many segmented line several times.  Segments per sec.
blitstones:   Rectangle bit blits.  K pixels per second.
spacestones:  "Spaceout", updates per second.

The Results: [The second table is relative to cc]

		cc	ccP	gcc	gccP	gccP+

charstones	112	112	114	114	115
stringstones	3316	3380	3968	4097	4540
linestones	97	97	100	102	110
segstones	597	598	834	864	1306
tilestones	4543	5180	6545	6751	6971
spacestones	1.33	1.22	1.45	1.45	1.50

charstones	1.00	1.00	1.02	1.02	1.03
stringstones	1.00	1.02	1.20	1.24	1.37
linestones	1.00	1.00	1.03	1.05	1.13
segstones	1.00	1.00	1.40	1.45	2.19
tilestones	1.00	1.14	1.44	1.49	1.53
spacestones	1.00	0.92	1.09	1.09	1.13

Discussion:

Simply using gcc is a significant improvement.  The Purdue speedups
help quite a bit with large bit blits but seem modest otherwise.
They help even more with gcc than with cc.  Purdue+ adds another
respectable increase.  Given the the server itself is also smaller,
I would stongly recommend gcc for Sun 3/50's.

You will notice that nothing really helps the charstones.  I profiled
the server to see what was up.  The results shocked me: for the
charstones, less than 2% of the server run time was spent actually
drawing anything!!  Forgive me for airing some dirty laundry in
public, but here is the top of the flat profile (from gprof):

   %  cumulative    self              self    total          
 time   seconds   seconds    calls  ms/call  ms/call name    
 42.1      32.48    32.48    30968     1.05     1.05  _read [4]
 15.3      44.26    11.78                            mcount (599)
  9.1      51.24     6.98    10199     0.68     0.68  _writev [12]
  5.1      55.18     3.94    10446     0.38     0.38  _select [13]
  4.3      58.50     3.32    10583     0.31     0.31  _gettimeofday [15]

The large "mcount" is essentially function call overhead.  The fact
that it is so large is a consequence of nice modular and layered code.
[Note that "gcc -finline-functions" produced no additional preformance
improvement, largely because most of function calls happen between
layers which are in different source code modules.]

For every character written, the main server loop (Dispatch) does a
select to check for input, and three reads (one for the client request,
one for the keyboard, and one for the mouse).  The gettimeofday system
call is also made once per input - are you ready for this? - to know
when to blank the screen.

How To Speed Up The Server:

The patches so far have concentrated on speeding up the graphics.  While
this is important, the whole structure of the sample server that Xsun is
built on puts another kind of limit on performance.  Client/Server
communication speed, when done via Unix domain sockets, is an important
bottleneck.  At the high end, I have heard (though only second hand)
that Ardent has used a shared memory communication scheme.  Even with
socket I/O, allowing the Sun server direct access to the keyboard, mouse,
and system clock would help a lot.

With only a tiny bit of "layering violation" you can help Xsun a fair
amount.  For example, use the results of the select made in os/4.2bsd to
decide whether to try (non-blocking) reads in the sun modules for keyboard
and mouse input.  That alone can reduce your read sys calls by 66%.

I have begun work along these lines and plan to post it at a future
date.  I will be out of town over the holidays but hope to pick this
up again in January.  I would have waited until I had some concrete
results, but the time seemed ripe for some benchmark results.

I wish to sincerely thank the authors of the Purdue, and Purdue+ patches,
the Blitstones package, and of course the folks at MIT for X11R3.

- Phil
<phil@brl.mil>
uunet!brl!phil

dshr@SUN.COM (David Rosenthal) (12/20/88)

Far be it from me to discourage people from profiling the server and
speeding it up.  I have been trying to persuade people to do this
ever since R1,  and I am not trying to justify any particular piece
of code in the server - I believe that there is a lot of scope for
tuning.

However,  beware the "optimize the idle loop" syndrome.  Ouputting
a large number of characters one at a time is a truly bloody stupid
thing to do with X.  Speeding up this and making no other change to
the server will not make any real application faster.  Fixing any
application that does output many characters one-at-a-time to do
something more sensible will make more difference to that application
than any amount of server tuning.

If you are going to profile the server and find things to speed up
please look at it doing something realistic,  not "charstones".  I
have seen more effort wasted optimizing to unrealistic benchmarks
than I can remember.

In particular,  if you are reading the keyboard and mouse all the time
there is something wrong.  The Sun server uses SIGIO and should never
read them unless there is something there to read.  Use trace(1) to
find out whether this is so.

	David.