[comp.arch] reliable/reproduceable benchmarks on SGI MIPS box

andrew@alice.att.com (Andrew Hume) (12/24/90)

	I am running some benchmarks on a variety of machines
and in particular, on a SGI 4D/380, a multiprocesor with 8
33MHz R3000 cpus. my benchmark reads in about 1.1MB of text
into an internal buffer and then runs cpu bound for about
40s. total memory usage is <2MB; the machine's memory is 256MB.
The benchmarks are run with the machine in single user (Unix)
mode with normally mounted NFS filesystems unmounted. No other
processes (excpet paging daemon etc) are running.

	my problem is that I see quite large variations over
multiple runs of the same benchmark, sometimes as much as
1.26%. Now, the resolution of the timer is .01s and i should se
an accuracy of about .01/40 or .025%. I am a factor of 50 off this.
does anyone know how i can run these benchmarks so as to get reproducible
timings? (i note as an aside that just running the benchmarks on the cray
in multi-user mode yields variations of the order of .15% which is
satisfactory).

	andrew hume
	andrew@research.att.com

tve@sprite.berkeley.edu (Thorsten von Eicken) (12/24/90)

... quick 2 cents worth of guesses:
you haven't said whether you're running your program on all 8 processors
or on only one of them. if you're running on only one, could it be that
the other seven interfere? What happens if you run a "for(;;);" program
on seven processors while running the benchmark on the eighth?
also, is there a cache-flush system call you can call before starting the
timer?
	TvE

andrew@alice.att.com (Andrew Hume) (12/25/90)

In article <9932@pasteur.Berkeley.EDU>, tve@sprite.berkeley.edu (Thorsten von Eicken) writes:
~ ... quick 2 cents worth of guesses:
~ you haven't said whether you're running your program on all 8 processors
~ or on only one of them. if you're running on only one, could it be that
~ the other seven interfere? What happens if you run a "for(;;);" program
~ on seven processors while running the benchmark on the eighth?
~ also, is there a cache-flush system call you can call before starting the
~ timer?


	the program runs on just one cpu. the other processes are presumably
idle (or running some idle process). does cache-flush refer to file system?
if so, i don't see the need; my benchmark generates 200 bytes every run
(5 bytes/sec) and i'm sure one of the other 7 spare cpu's could handle
sending that one block off.

	still puzzled,
	andrew

p.s. how the hell do the specmark people do this stuff?

raytrace@cutmcvax.cs.curtin.edu.au (Phil Dench) (12/27/90)

andrew@alice.att.com (Andrew Hume) writes:


>	I am running some benchmarks on a variety of machines
>and in particular, on a SGI 4D/380, a multiprocesor with 8
>33MHz R3000 cpus. my benchmark reads in about 1.1MB of text
>into an internal buffer and then runs cpu bound for about
>40s. total memory usage is <2MB; the machine's memory is 256MB.
>The benchmarks are run with the machine in single user (Unix)
>mode with normally mounted NFS filesystems unmounted. No other
>processes (excpet paging daemon etc) are running.

>	my problem is that I see quite large variations over
>multiple runs of the same benchmark, sometimes as much as
>1.26%. Now, the resolution of the timer is .01s and i should se
>an accuracy of about .01/40 or .025%. I am a factor of 50 off this.
>does anyone know how i can run these benchmarks so as to get reproducible
>timings? (i note as an aside that just running the benchmarks on the cray
>in multi-user mode yields variations of the order of .15% which is
>satisfactory).

>	andrew hume
>	andrew@research.att.com

You LUCKY BASTARD! I dream of 256Mb 8 processor SG plus access to a Cray.
There's no pleasing some people :?)
--

	Phil Dench  Andrew Marriott.

--------------------------------------------+----------------------------------
                                            | School of Computer Science,
ACSNet: raytrace@cutmcvax.cs.curtin.edu.au  | Curtin University of Technology,
UUCP:   ...!uunet!munnari!cutmcvax!raytrace | Kent Street,
ARPA:   raytrace@cutmcvax.cs.curtin.edu.au  | Bentley
                                            | Western Australia, 6102
--------------------------------------------+----------------------------------

cprice@mips.COM (Charlie Price) (12/29/90)

In article <11737@alice.att.com> andrew@alice.att.com (Andrew Hume) writes:
>
>	I am running some benchmarks on a variety of machines
>and in particular, on a SGI 4D/380, a multiprocesor with 8
>33MHz R3000 cpus.
...
>	my problem is that I see quite large variations over
>multiple runs of the same benchmark, sometimes as much as
>1.26%. Now, the resolution of the timer is .01s and i should se
>an accuracy of about .01/40 or .025%. I am a factor of 50 off this.
>does anyone know how i can run these benchmarks so as to get reproducible
>timings? (i note as an aside that just running the benchmarks on the cray
>in multi-user mode yields variations of the order of .15% which is
>satisfactory).
>
>	andrew hume
>	andrew@research.att.com

One source of variability in benchmark times that nobody else has
mentioned (so I will) is cache conflicts.
Identical exeuctions of a benchmark use the same *virtual* locations
in the same pattern, but these virtual locations get mapped to
physical locations, and in particular cache locations, in some
manner determined by the OS, previous activity on the machine,
the phase of the moon...
If subsequent executions of the program get different patterns
of cache conflict then you can easily see several percent
difference in the execution time due to differences in cache conflict.
This isn't just speculation.
In the early days at MIPS some maddening variability in execution times
was finally traced to variability in page alocation.
The execution variability mostly went away when the OS did page coloring
(matching the physical and virtual address of a page in certain ways)
to remove the cache-use variability.

I suspect that if the OS isn't giving you reproducible use of the
caches that you won't ever be able to get reproducible benchmark times.
-- 
Charlie Price    cprice@mips.mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086-23650

yohn@tumult.asd.sgi.com (Mike Thompson) (01/03/91)

In article <44383@mips.mips.COM>, cprice@mips.COM (Charlie Price) writes:
> In article <11737@alice.att.com> andrew@alice.att.com (Andrew Hume) writes:
> >
> >	I am running some benchmarks on a variety of machines
> >and in particular, on a SGI 4D/380, a multiprocesor with 8
> >33MHz R3000 cpus.
> ...
> >	my problem is that I see quite large variations over
> >multiple runs of the same benchmark, sometimes as much as
> >1.26%....
> >
> >	andrew hume
> >	andrew@research.att.com
> 
> One source of variability in benchmark times that nobody else has
> mentioned (so I will) is cache conflicts.
> Identical exeuctions of a benchmark use the same *virtual* locations
> in the same pattern, but these virtual locations get mapped to
> physical locations, and in particular cache locations, in some
> manner determined by the OS, previous activity on the machine,
> the phase of the moon...
> If subsequent executions of the program get different patterns
> of cache conflict then you can easily see several percent
> difference in the execution time due to differences in cache conflict.
> This isn't just speculation.
> In the early days at MIPS some maddening variability in execution times
> was finally traced to variability in page alocation.
> The execution variability mostly went away when the OS did page coloring
> (matching the physical and virtual address of a page in certain ways)
> to remove the cache-use variability.
> 
> I suspect that if the OS isn't giving you reproducible use of the
> caches that you won't ever be able to get reproducible benchmark times.
> -- 
> Charlie Price    cprice@mips.mips.com        (408) 720-1700
> MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086-23650

Good guess, but this is probably not the cause.  The SGI OS (a.k.a.
IRIX) manages page/cache coloring.  It is only when memory is very
tight that a process can likely get pages not cache-aligned optimally.

Mike Thompson	yohn@sgi.com
Silicon Graphics Computer Systems

3ksnn64@cidmac.ecn.purdue.edu (Joe Cychosz) (01/03/91)

In article <11737@alice.att.com> andrew@alice.att.com (Andrew Hume) writes:
>
>	I am running some benchmarks on a variety of machines
>and in particular, on a SGI 4D/380, a multiprocesor with 8
>33MHz R3000 cpus.
...
>	my problem is that I see quite large variations over
>multiple runs of the same benchmark, sometimes as much as
>1.26%. Now, the resolution of the timer is .01s and i should se
>an accuracy of about .01/40 or .025%. I am a factor of 50 off this.
>does anyone know how i can run these benchmarks so as to get reproducible
>timings? (i note as an aside that just running the benchmarks on the cray
>in multi-user mode yields variations of the order of .15% which is
>satisfactory).

I am puzzled why you think the accuracy should be .01/40?  Anyway my
experience with the 60hz and 100hz clocks is that they are not
exceptionally accurate and are suseptable to process switching.  Someone
gets the whole interval, even though they may have used only a small
portion of that interval. The Cray uses a very accurate clock which runs
at the cycle time of the machine (~6ns for YMP).

I do not know if SGI has made access to the hi-res clock on the MIPS
chip yet.

This doesn't really solve your problem here, but it may come in handy
when running benchmarks.


/* ----	second - Return user CPU time in seconds. ---------------------	*/
/*									*/
/*	Runction to return the elapse CPU time in seconds.  To properly	*/
/*	compute the elapse CPU time, the function should be called	*/
/*	twice, once in the beginning to get the initial CPU time, and	*/
/*	a second time to get the ending time.  The elapse time is then	*/
/*	the computed difference between the two times.			*/
/*	between the two times.						*/
/*									*/
/*									*/
/*	machine			low-res	  hi-res			*/
/*	Ardent Titan		  100Hz	  62.5ns			*/
/*	BBN Butterfly TC-2000	  100Hz					*/
/*	Convex C220		   60Hz	     1us			*/
/*	Cray 2, XMP, YMP	   ~6ns YMP				*/
/*	ETA 10			  100Hz	  P 24ns, P* 21, Q 19, E 10.5	*/
/*						  F 8.5, G 7		*/
/*	Gould 9080 and NP1	   60Hz					*/
/*	Silicon Graphics 4D	  100Hz					*/
/*	Sun 3 and 4		   60Hz					*/
/*	Vax 11/780		   60Hz					*/
/*									*/
/*	Author:								*/
/*	   J. M. Cychosz	12/15/89.				*/
/*	   Purdue University CADLAB					*/
/*									*/
/* --------------------------------------------------------------------	*/


#include <sys/types.h>
#include <sys/param.h>			/* HZ defined here		*/
#include <sys/times.h>			/* used by times()		*/
#include <sys/time.h>			/* used by getrusage()		*/
#include <sys/resource.h>

#if	ardent
extern	double	fputim ( double );
#endif

#if	butterfly			/* HZ not defined for BBN TC2000*/
#define	HZ	100.0
#endif

#if	eta10
#include <sys/syseta.h>

static int	Pico_per_Cycle;		/* ETA10 ps per machine cycle	*/

int asm	q8ic7	()	{
	tfc	%18
}
#endif

#if	gould				/* HZ not defined for Gould 9080*/
#define	HZ	60.0			/* or NP1			*/
#endif


/* ----	second - Return user CPU time in seconds. ---------------------	*/

double	second	()

{

#if	ardent			/* Ardent Titan: use internal MIPS clock*/
	return ( (double) fputim (0.0) );
#define	_CLOCK_
#endif

#if	convex			/* Convex C220: use 1us clock		*/
	struct rusage	res;

	getrusage (0,&res);
	return ( (double)
		 (res.ru_exutime.tv_sec + res.ru_exutime.tv_usec*1e-6) );
#define	_CLOCK_
#endif

#if	eta10			/* ETA 10: use real time clock		*/
	if  (!Pico_per_Cycle) 
	    (void) syseta (ETA_PSPERCYCLE, &Pico_per_Cycle);
	return ( (double) q8ic7 () * (Pico_per_Cycle * 1e-12) );
#define	_CLOCK_
#endif

#ifndef	_CLOCK_			/* Default: use low-res times() clock	*/
	struct tms	clock;

	times (&clock);

/*	HZ = 60.0	Convex, Sun, Gould, Vax				*/
/*	HZ = 100.0	SGI, Ardent, BBN Butterfly, ETA 10		*/
/*	HZ = function(cycle_time) on Cray 2, XMP, and YMP		*/

	return ( (double) clock.tms_utime / HZ );
#endif

}

tdonahue@bbn.com (Tim Donahue) (01/03/91)

In article <1991Jan3.035143.18865@noose.ecn.purdue.edu>, 3ksnn64@cidmac (Joe Cychosz) writes:
>In article <11737@alice.att.com> andrew@alice.att.com (Andrew Hume) writes:
>>
>> <stuff about timer resolution deleted>
>> ...
>
>>/* ----	second - Return user CPU time in seconds. ---------------------	*/
>/*									*/
>/*	Runction to return the elapse CPU time in seconds.  To properly	*/
>/*	compute the elapse CPU time, the function should be called	*/
>/*	twice, once in the beginning to get the initial CPU time, and	*/
>/*	a second time to get the ending time.  The elapse time is then	*/
>/*	the computed difference between the two times.			*/
>/*	between the two times.						*/
>/*									*/
>/*									*/
>/*	machine			low-res	  hi-res			*/
>/*     ...
>/*	BBN Butterfly TC-2000	  100Hz		

The TC2000 does indeed include a high-resolution clock.  This 32-bit
clock has an resolution of 1 microsecond and is synchronized by hardware
among all processors in the system.  Call getusecclock() under nX or
pSOS+m to obtain the value of this clock.

The nX and pSOS operating systems also maintain a 64-bit software clock
whose low order bits are formed by the 32-bit microsecond hardware
clock.  To obtain the value of this clock, use get64bitclock().

Finally, two microsecond-resolution interrupting timers are available
under pSOS+m.  These timers compare the programmed value with the value
of the microsecond clock.  When the microsecond clock is later or equal,
interrupts are delivered to the CPU.

Cheers,
Tim