andrew@alice.att.com (Andrew Hume) (12/24/90)
I am running some benchmarks on a variety of machines and in particular, on a SGI 4D/380, a multiprocesor with 8 33MHz R3000 cpus. my benchmark reads in about 1.1MB of text into an internal buffer and then runs cpu bound for about 40s. total memory usage is <2MB; the machine's memory is 256MB. The benchmarks are run with the machine in single user (Unix) mode with normally mounted NFS filesystems unmounted. No other processes (excpet paging daemon etc) are running. my problem is that I see quite large variations over multiple runs of the same benchmark, sometimes as much as 1.26%. Now, the resolution of the timer is .01s and i should se an accuracy of about .01/40 or .025%. I am a factor of 50 off this. does anyone know how i can run these benchmarks so as to get reproducible timings? (i note as an aside that just running the benchmarks on the cray in multi-user mode yields variations of the order of .15% which is satisfactory). andrew hume andrew@research.att.com
tve@sprite.berkeley.edu (Thorsten von Eicken) (12/24/90)
... quick 2 cents worth of guesses: you haven't said whether you're running your program on all 8 processors or on only one of them. if you're running on only one, could it be that the other seven interfere? What happens if you run a "for(;;);" program on seven processors while running the benchmark on the eighth? also, is there a cache-flush system call you can call before starting the timer? TvE
andrew@alice.att.com (Andrew Hume) (12/25/90)
In article <9932@pasteur.Berkeley.EDU>, tve@sprite.berkeley.edu (Thorsten von Eicken) writes:
~ ... quick 2 cents worth of guesses:
~ you haven't said whether you're running your program on all 8 processors
~ or on only one of them. if you're running on only one, could it be that
~ the other seven interfere? What happens if you run a "for(;;);" program
~ on seven processors while running the benchmark on the eighth?
~ also, is there a cache-flush system call you can call before starting the
~ timer?
the program runs on just one cpu. the other processes are presumably
idle (or running some idle process). does cache-flush refer to file system?
if so, i don't see the need; my benchmark generates 200 bytes every run
(5 bytes/sec) and i'm sure one of the other 7 spare cpu's could handle
sending that one block off.
still puzzled,
andrew
p.s. how the hell do the specmark people do this stuff?
raytrace@cutmcvax.cs.curtin.edu.au (Phil Dench) (12/27/90)
andrew@alice.att.com (Andrew Hume) writes: > I am running some benchmarks on a variety of machines >and in particular, on a SGI 4D/380, a multiprocesor with 8 >33MHz R3000 cpus. my benchmark reads in about 1.1MB of text >into an internal buffer and then runs cpu bound for about >40s. total memory usage is <2MB; the machine's memory is 256MB. >The benchmarks are run with the machine in single user (Unix) >mode with normally mounted NFS filesystems unmounted. No other >processes (excpet paging daemon etc) are running. > my problem is that I see quite large variations over >multiple runs of the same benchmark, sometimes as much as >1.26%. Now, the resolution of the timer is .01s and i should se >an accuracy of about .01/40 or .025%. I am a factor of 50 off this. >does anyone know how i can run these benchmarks so as to get reproducible >timings? (i note as an aside that just running the benchmarks on the cray >in multi-user mode yields variations of the order of .15% which is >satisfactory). > andrew hume > andrew@research.att.com You LUCKY BASTARD! I dream of 256Mb 8 processor SG plus access to a Cray. There's no pleasing some people :?) -- Phil Dench Andrew Marriott. --------------------------------------------+---------------------------------- | School of Computer Science, ACSNet: raytrace@cutmcvax.cs.curtin.edu.au | Curtin University of Technology, UUCP: ...!uunet!munnari!cutmcvax!raytrace | Kent Street, ARPA: raytrace@cutmcvax.cs.curtin.edu.au | Bentley | Western Australia, 6102 --------------------------------------------+----------------------------------
cprice@mips.COM (Charlie Price) (12/29/90)
In article <11737@alice.att.com> andrew@alice.att.com (Andrew Hume) writes: > > I am running some benchmarks on a variety of machines >and in particular, on a SGI 4D/380, a multiprocesor with 8 >33MHz R3000 cpus. ... > my problem is that I see quite large variations over >multiple runs of the same benchmark, sometimes as much as >1.26%. Now, the resolution of the timer is .01s and i should se >an accuracy of about .01/40 or .025%. I am a factor of 50 off this. >does anyone know how i can run these benchmarks so as to get reproducible >timings? (i note as an aside that just running the benchmarks on the cray >in multi-user mode yields variations of the order of .15% which is >satisfactory). > > andrew hume > andrew@research.att.com One source of variability in benchmark times that nobody else has mentioned (so I will) is cache conflicts. Identical exeuctions of a benchmark use the same *virtual* locations in the same pattern, but these virtual locations get mapped to physical locations, and in particular cache locations, in some manner determined by the OS, previous activity on the machine, the phase of the moon... If subsequent executions of the program get different patterns of cache conflict then you can easily see several percent difference in the execution time due to differences in cache conflict. This isn't just speculation. In the early days at MIPS some maddening variability in execution times was finally traced to variability in page alocation. The execution variability mostly went away when the OS did page coloring (matching the physical and virtual address of a page in certain ways) to remove the cache-use variability. I suspect that if the OS isn't giving you reproducible use of the caches that you won't ever be able to get reproducible benchmark times. -- Charlie Price cprice@mips.mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086-23650
yohn@tumult.asd.sgi.com (Mike Thompson) (01/03/91)
In article <44383@mips.mips.COM>, cprice@mips.COM (Charlie Price) writes: > In article <11737@alice.att.com> andrew@alice.att.com (Andrew Hume) writes: > > > > I am running some benchmarks on a variety of machines > >and in particular, on a SGI 4D/380, a multiprocesor with 8 > >33MHz R3000 cpus. > ... > > my problem is that I see quite large variations over > >multiple runs of the same benchmark, sometimes as much as > >1.26%.... > > > > andrew hume > > andrew@research.att.com > > One source of variability in benchmark times that nobody else has > mentioned (so I will) is cache conflicts. > Identical exeuctions of a benchmark use the same *virtual* locations > in the same pattern, but these virtual locations get mapped to > physical locations, and in particular cache locations, in some > manner determined by the OS, previous activity on the machine, > the phase of the moon... > If subsequent executions of the program get different patterns > of cache conflict then you can easily see several percent > difference in the execution time due to differences in cache conflict. > This isn't just speculation. > In the early days at MIPS some maddening variability in execution times > was finally traced to variability in page alocation. > The execution variability mostly went away when the OS did page coloring > (matching the physical and virtual address of a page in certain ways) > to remove the cache-use variability. > > I suspect that if the OS isn't giving you reproducible use of the > caches that you won't ever be able to get reproducible benchmark times. > -- > Charlie Price cprice@mips.mips.com (408) 720-1700 > MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086-23650 Good guess, but this is probably not the cause. The SGI OS (a.k.a. IRIX) manages page/cache coloring. It is only when memory is very tight that a process can likely get pages not cache-aligned optimally. Mike Thompson yohn@sgi.com Silicon Graphics Computer Systems
3ksnn64@cidmac.ecn.purdue.edu (Joe Cychosz) (01/03/91)
In article <11737@alice.att.com> andrew@alice.att.com (Andrew Hume) writes: > > I am running some benchmarks on a variety of machines >and in particular, on a SGI 4D/380, a multiprocesor with 8 >33MHz R3000 cpus. ... > my problem is that I see quite large variations over >multiple runs of the same benchmark, sometimes as much as >1.26%. Now, the resolution of the timer is .01s and i should se >an accuracy of about .01/40 or .025%. I am a factor of 50 off this. >does anyone know how i can run these benchmarks so as to get reproducible >timings? (i note as an aside that just running the benchmarks on the cray >in multi-user mode yields variations of the order of .15% which is >satisfactory). I am puzzled why you think the accuracy should be .01/40? Anyway my experience with the 60hz and 100hz clocks is that they are not exceptionally accurate and are suseptable to process switching. Someone gets the whole interval, even though they may have used only a small portion of that interval. The Cray uses a very accurate clock which runs at the cycle time of the machine (~6ns for YMP). I do not know if SGI has made access to the hi-res clock on the MIPS chip yet. This doesn't really solve your problem here, but it may come in handy when running benchmarks. /* ---- second - Return user CPU time in seconds. --------------------- */ /* */ /* Runction to return the elapse CPU time in seconds. To properly */ /* compute the elapse CPU time, the function should be called */ /* twice, once in the beginning to get the initial CPU time, and */ /* a second time to get the ending time. The elapse time is then */ /* the computed difference between the two times. */ /* between the two times. */ /* */ /* */ /* machine low-res hi-res */ /* Ardent Titan 100Hz 62.5ns */ /* BBN Butterfly TC-2000 100Hz */ /* Convex C220 60Hz 1us */ /* Cray 2, XMP, YMP ~6ns YMP */ /* ETA 10 100Hz P 24ns, P* 21, Q 19, E 10.5 */ /* F 8.5, G 7 */ /* Gould 9080 and NP1 60Hz */ /* Silicon Graphics 4D 100Hz */ /* Sun 3 and 4 60Hz */ /* Vax 11/780 60Hz */ /* */ /* Author: */ /* J. M. Cychosz 12/15/89. */ /* Purdue University CADLAB */ /* */ /* -------------------------------------------------------------------- */ #include <sys/types.h> #include <sys/param.h> /* HZ defined here */ #include <sys/times.h> /* used by times() */ #include <sys/time.h> /* used by getrusage() */ #include <sys/resource.h> #if ardent extern double fputim ( double ); #endif #if butterfly /* HZ not defined for BBN TC2000*/ #define HZ 100.0 #endif #if eta10 #include <sys/syseta.h> static int Pico_per_Cycle; /* ETA10 ps per machine cycle */ int asm q8ic7 () { tfc %18 } #endif #if gould /* HZ not defined for Gould 9080*/ #define HZ 60.0 /* or NP1 */ #endif /* ---- second - Return user CPU time in seconds. --------------------- */ double second () { #if ardent /* Ardent Titan: use internal MIPS clock*/ return ( (double) fputim (0.0) ); #define _CLOCK_ #endif #if convex /* Convex C220: use 1us clock */ struct rusage res; getrusage (0,&res); return ( (double) (res.ru_exutime.tv_sec + res.ru_exutime.tv_usec*1e-6) ); #define _CLOCK_ #endif #if eta10 /* ETA 10: use real time clock */ if (!Pico_per_Cycle) (void) syseta (ETA_PSPERCYCLE, &Pico_per_Cycle); return ( (double) q8ic7 () * (Pico_per_Cycle * 1e-12) ); #define _CLOCK_ #endif #ifndef _CLOCK_ /* Default: use low-res times() clock */ struct tms clock; times (&clock); /* HZ = 60.0 Convex, Sun, Gould, Vax */ /* HZ = 100.0 SGI, Ardent, BBN Butterfly, ETA 10 */ /* HZ = function(cycle_time) on Cray 2, XMP, and YMP */ return ( (double) clock.tms_utime / HZ ); #endif }
tdonahue@bbn.com (Tim Donahue) (01/03/91)
In article <1991Jan3.035143.18865@noose.ecn.purdue.edu>, 3ksnn64@cidmac (Joe Cychosz) writes: >In article <11737@alice.att.com> andrew@alice.att.com (Andrew Hume) writes: >> >> <stuff about timer resolution deleted> >> ... > >>/* ---- second - Return user CPU time in seconds. --------------------- */ >/* */ >/* Runction to return the elapse CPU time in seconds. To properly */ >/* compute the elapse CPU time, the function should be called */ >/* twice, once in the beginning to get the initial CPU time, and */ >/* a second time to get the ending time. The elapse time is then */ >/* the computed difference between the two times. */ >/* between the two times. */ >/* */ >/* */ >/* machine low-res hi-res */ >/* ... >/* BBN Butterfly TC-2000 100Hz The TC2000 does indeed include a high-resolution clock. This 32-bit clock has an resolution of 1 microsecond and is synchronized by hardware among all processors in the system. Call getusecclock() under nX or pSOS+m to obtain the value of this clock. The nX and pSOS operating systems also maintain a 64-bit software clock whose low order bits are formed by the 32-bit microsecond hardware clock. To obtain the value of this clock, use get64bitclock(). Finally, two microsecond-resolution interrupting timers are available under pSOS+m. These timers compare the programmed value with the value of the microsecond clock. When the microsecond clock is later or equal, interrupts are delivered to the CPU. Cheers, Tim