andrew@alice.att.com (Andrew Hume) (01/11/91)
This is a regurgitation and summary of my quest for accurate timings on a multiple-cpu box. First, let me redescribe the situation. I have about 30 benchmarks, all very similar (different string searching algorithms) with the following behaviour: 1) read approx 1.2MB of text from a file into memory, 2) approx 500 times, search through a 1MB buffer looking for a string, 3) report about 80-100 bytes of output. Each benchmark takes between 30-100 cpu seconds to run and the total memory required by the program is approx 1.3MB. Within a ``benchmark set'', the benchmarks are run serially, and 3 sets were run (serially). I measured the variation of times for each benchmark and recorded the maximum variation and the S.D.. The system is an SGI 4D/380; it has 8 33MHz R3000s sharing 256MB of physical memory. Each R3000 board has its own caches. Initially, I just ran the benchmarks on top of Unix (Irix 3.3.1) running in single user mode. The only other processes running were the ubiquitous sched, vhand and bdflush. Several filesystems were mounted (but only local disk filesystems, no NFS). Under these conditions, the maximum variation seen was 1.26% (S.D = .65%). This surprised me; I had expected much better accuracy for cpu bound processes, say, an error of a few clock ticks in 3000 or .1%. I then posted to comp.arch. As an aside, I mentioned that doing the same runs on a Cray XMP/280 (running in regular multi-user mode) yielded a max variation of .11%. Responses ranged from cpu envy to ``you're lucky to get such little variation'' to Unix can't measure time anyway. Following the more concrete suggestions, I tried running the benchmarks on a specific cpu (and not cpu 0 which is more likely to incur o/s overhead), running busy loops on the other ``unused'' cpus and running at high priority. The busy loops were "main(){for(;;);}" and i used nice --50 runon 7 benchmark.sh This gratifyingly (however mysteriously) reduced the variability to .91% (SD=.46%). I thought hard about other weird things to try but couldn't come up with anything. Tom Szymanski suggested the most likely cause for the variability is boundary effects with the caches, particularly as the program memory is about four times the cache size (16k x 16bytes). So for now, I take this sort of precision as a given and am proceeding with my research. I would like to stress that I think the SGI box is not particularly bad in this regard. In fact, it fared very well compared to the other systems I measured. The following table shows the ave variability (and not the maximum i used above) for a bunch of systems: 386 sparc sgi vax 68k cray .60% .32% .40% .82% .38% .11% Of these, only the sgi and cray were multiple cpu systems. (I am mystified as to why the 386 was so bad. The average benchmark takes about 250 cpu secs on this system, yet one benchmark's time varied by over 10s!) Thanks for the helpful responses and all those that responded (in no order): eugene miya, hugh lamaster, forest baskett, chris shaw, dave fields, dan karron, chris hull, gavin bell, amir majidimehr, mike thompson and charlie price. i was pleased at the number of SGI folks that repsonded. Andrew Hume andrew@research.att.com