[comp.arch] variability in benchmark results

andrew@alice.att.com (Andrew Hume) (01/11/91)
	This is a regurgitation and summary of my quest for
accurate timings on a multiple-cpu box. First, let me redescribe
the situation.
	I have about 30 benchmarks, all very similar (different
string searching algorithms) with the following behaviour:
1) read approx 1.2MB of text from a file into memory,
2) approx 500 times, search through a 1MB buffer looking for a string,
3) report about 80-100 bytes of output.
Each benchmark takes between 30-100 cpu seconds to run and the total memory
required by the program is approx 1.3MB.
	Within a ``benchmark set'', the benchmarks are run serially,
and 3 sets were run (serially). I measured the variation of times for
each benchmark and recorded the maximum variation and the S.D.. The system
is an SGI 4D/380; it has 8 33MHz R3000s sharing 256MB of physical memory.
Each R3000 board has its own caches.
	Initially, I just ran the benchmarks on top of Unix (Irix 3.3.1)
running in single user mode. The only other processes running were
the ubiquitous sched, vhand and bdflush. Several filesystems were
mounted (but only local disk filesystems, no NFS). Under these conditions,
the maximum variation seen was 1.26% (S.D = .65%). This surprised me;
I had expected much better accuracy for cpu bound processes, say,
an error of a few clock ticks in 3000 or .1%. I then posted to comp.arch.
As an aside, I mentioned that doing the same runs on a Cray XMP/280
(running in regular multi-user mode) yielded a max variation of .11%.

	Responses ranged from cpu envy to ``you're lucky to get such
little variation'' to Unix can't measure time anyway. Following the
more concrete suggestions, I tried running the benchmarks on a specific
cpu (and not cpu 0 which is more likely to incur o/s overhead),
running busy loops on the other ``unused'' cpus and running at high
priority. The busy loops were "main(){for(;;);}" and i used
	nice --50 runon 7 benchmark.sh
This gratifyingly (however mysteriously) reduced the variability to .91%
(SD=.46%). I thought hard about other weird things to try but couldn't
come up with anything. Tom Szymanski suggested the most likely cause
for the variability is boundary effects with the caches, particularly
as the program memory is about four times the cache size (16k x 16bytes).
	So for now, I take this sort of precision as a given and
am proceeding with my research. I would like to stress that I think the
SGI box is not particularly bad in this regard. In fact, it fared very well
compared to the other systems I measured. The following table shows the ave
variability (and not the maximum i used above) for a bunch of systems:
	386	sparc	sgi	vax	68k	cray
	.60%	.32%	.40%	.82%	.38%	.11%
Of these, only the sgi and cray were multiple cpu systems. (I am mystified
as to why the 386 was so bad. The average benchmark takes about 250 cpu secs
on this system, yet one benchmark's time varied by over 10s!)

	Thanks for the helpful responses and all those that responded
(in no order): eugene miya, hugh lamaster, forest baskett, chris shaw,
dave fields, dan karron, chris hull, gavin bell, amir majidimehr,
mike thompson and charlie price. i was pleased at the number of SGI folks
that repsonded.

	Andrew Hume
	andrew@research.att.com