phil@BRL.MIL (Phil Dykstra) (12/19/88)
Some Xsun Benchmarks I was curious to see how much the posted Xsun speedups were actually helping, so I ran some benchmarks. The results were interesting. The shock occurred when I profiled the server. This message will detail some of those results. The System Used: Sun 3/50 w/68881 SunOS 3.4 w/Berkeley TCP. The Sun 3.4 C compiler and GNU gcc 1.31 were used. X11R3 with and without the Purdue and Purdue+ speedups was tested. The Servers: cc - straight X11R3 server with Sun cc -O ccP - above but with Purdue speedups gcc - straight X11R3 server with gcc -O gccP - Purdue speedups with gcc -O gccP+ - Purdue and Purdue+ speedups with gcc -O text data bss dec hex cc 532480 40960 18040 591480 90678 ccP 524288 40960 21880 587128 8f578 gcc 442368 32768 17792 492928 78580 gccP 442368 32768 21632 496768 79480 gccP+ 442368 32768 21644 496780 7948c The Benchmarks: Lacking anything else, I used the "blitstone" package for testing performance (available on expo.lcs.mit.edu in contrib/stones.tar). Those routines are briefly described below. I used the default parameter for each benchmark. I also benchmarked the "maze" program by calling srandom(1) to get repeatable random numbers, but none of the server configurations made any noticeable improvement (*thick* line performance has not gone up in these patches). Also timed was how long it took to "log in" to the server, but the server itself only amounts to about 6 seconds of the roughly 30 second startup time in my environment. If anyone has any other benchmarks please share them. charstones: Writes many characters one at a time. Character per second. stringstones: Writes a 96 character string many times. Character per sec. linestones: Draws many single lines ("stringart"). Lines per second. segstones: Draws a many segmented line several times. Segments per sec. blitstones: Rectangle bit blits. K pixels per second. spacestones: "Spaceout", updates per second. The Results: [The second table is relative to cc] cc ccP gcc gccP gccP+ charstones 112 112 114 114 115 stringstones 3316 3380 3968 4097 4540 linestones 97 97 100 102 110 segstones 597 598 834 864 1306 tilestones 4543 5180 6545 6751 6971 spacestones 1.33 1.22 1.45 1.45 1.50 charstones 1.00 1.00 1.02 1.02 1.03 stringstones 1.00 1.02 1.20 1.24 1.37 linestones 1.00 1.00 1.03 1.05 1.13 segstones 1.00 1.00 1.40 1.45 2.19 tilestones 1.00 1.14 1.44 1.49 1.53 spacestones 1.00 0.92 1.09 1.09 1.13 Discussion: Simply using gcc is a significant improvement. The Purdue speedups help quite a bit with large bit blits but seem modest otherwise. They help even more with gcc than with cc. Purdue+ adds another respectable increase. Given the the server itself is also smaller, I would stongly recommend gcc for Sun 3/50's. You will notice that nothing really helps the charstones. I profiled the server to see what was up. The results shocked me: for the charstones, less than 2% of the server run time was spent actually drawing anything!! Forgive me for airing some dirty laundry in public, but here is the top of the flat profile (from gprof): % cumulative self self total time seconds seconds calls ms/call ms/call name 42.1 32.48 32.48 30968 1.05 1.05 _read [4] 15.3 44.26 11.78 mcount (599) 9.1 51.24 6.98 10199 0.68 0.68 _writev [12] 5.1 55.18 3.94 10446 0.38 0.38 _select [13] 4.3 58.50 3.32 10583 0.31 0.31 _gettimeofday [15] The large "mcount" is essentially function call overhead. The fact that it is so large is a consequence of nice modular and layered code. [Note that "gcc -finline-functions" produced no additional preformance improvement, largely because most of function calls happen between layers which are in different source code modules.] For every character written, the main server loop (Dispatch) does a select to check for input, and three reads (one for the client request, one for the keyboard, and one for the mouse). The gettimeofday system call is also made once per input - are you ready for this? - to know when to blank the screen. How To Speed Up The Server: The patches so far have concentrated on speeding up the graphics. While this is important, the whole structure of the sample server that Xsun is built on puts another kind of limit on performance. Client/Server communication speed, when done via Unix domain sockets, is an important bottleneck. At the high end, I have heard (though only second hand) that Ardent has used a shared memory communication scheme. Even with socket I/O, allowing the Sun server direct access to the keyboard, mouse, and system clock would help a lot. With only a tiny bit of "layering violation" you can help Xsun a fair amount. For example, use the results of the select made in os/4.2bsd to decide whether to try (non-blocking) reads in the sun modules for keyboard and mouse input. That alone can reduce your read sys calls by 66%. I have begun work along these lines and plan to post it at a future date. I will be out of town over the holidays but hope to pick this up again in January. I would have waited until I had some concrete results, but the time seemed ripe for some benchmark results. I wish to sincerely thank the authors of the Purdue, and Purdue+ patches, the Blitstones package, and of course the folks at MIT for X11R3. - Phil <phil@brl.mil> uunet!brl!phil
dshr@SUN.COM (David Rosenthal) (12/20/88)
Far be it from me to discourage people from profiling the server and speeding it up. I have been trying to persuade people to do this ever since R1, and I am not trying to justify any particular piece of code in the server - I believe that there is a lot of scope for tuning. However, beware the "optimize the idle loop" syndrome. Ouputting a large number of characters one at a time is a truly bloody stupid thing to do with X. Speeding up this and making no other change to the server will not make any real application faster. Fixing any application that does output many characters one-at-a-time to do something more sensible will make more difference to that application than any amount of server tuning. If you are going to profile the server and find things to speed up please look at it doing something realistic, not "charstones". I have seen more effort wasted optimizing to unrealistic benchmarks than I can remember. In particular, if you are reading the keyboard and mouse all the time there is something wrong. The Sun server uses SIGIO and should never read them unless there is something there to read. Use trace(1) to find out whether this is so. David.