[comp.sys.sequent] timing parallel programs

beaumont@CompSci.Bristol.AC.UK (Tony Beaumont) (02/08/90)

We have a prototype OR-parallel Prolog system, written in C and running on a
sequent symmetry with 12 processors and a micro-second clock.

We want to get accurate timings of the run-times of Prolog programs when we
use 1, 2, ... 11 processors in order to accurately measure the speed-ups.

Our sequent is also running X-windows and there are 4 or 5 X-terminals
attached.

THE PROBLEM...
1.  How can we be sure that a program using (say) 6 DYNIX processes will
be run on 6 processors without interruption by other processes (ie
operating system processes, X-windows processes etc)

Is there a way to get a processor to exclusively run a process?

2.  Our current solution is to ask other users to log off when we make
timing runs, ensuring the load on the machine is as low as possible.
However, operating system processes are still running and although we
always leave at least 1 processor idle, how can we be sure that the
operating system processes do not interfere with (slow down) the processes
we are timing?  Also this approach is rather unfortunate in that we
require all other users to log off which in effect means that we have to
make our timings during the night.

email replies directly to me and if there is any interest I'll post a
summary of responses.

-Tony Beaumont
email (JANET): beaumont@uk.ac.bristol.compsci
Post: Department of Computer Science
      University of Bristol
      Bristol BS8 1TR
      UK

cudep@warwick.ac.uk (Ian Dickinson) (02/12/90)

In article <1324@csisles.Bristol.AC.UK> beaumont@CompSci.Bristol.AC.UK (Tony Beaumont) writes:
>Is there a way to get a processor to exclusively run a process?

Yes.
If you are running the code as setuid.
There is a system call called tmp_affinity or similar
(try 'man -k affinity' and you should get something in section 2)
This call will make a process have exclusive use of a single processor.
If you make all the parallel processes call this,
then spin until they are synchronised,
you should be able to conduct reasonable tests
without an other process intervening.

It might still be worthwhile running the tests during light load periods
so bus congestion, paging etc are less of a problem.

Hokay!
-- 
\/ato.  vato@uk.ac.warwick.  *NIX gives good head.  Support the FSF.  Plinth.
"When it's a smoking charred stump, that's too much.
                       Mine hasn't charred yet."    - entropy@alembic.acs.com

jim@cs.strath.ac.uk (Jim Reid) (02/12/90)

In article <1324@csisles.Bristol.AC.UK> beaumont@CompSci.Bristol.AC.UK (Tony Beaumont) writes:
>We want to get accurate timings of the run-times of Prolog programs when we
>use 1, 2, ... 11 processors in order to accurately measure the speed-ups.

> Our current solution is to ask other users to log off when we make
>timing runs, ensuring the load on the machine is as low as possible.
>However, operating system processes are still running and although we
>always leave at least 1 processor idle, how can we be sure that the
>operating system processes do not interfere with (slow down) the processes
>we are timing?  Also this approach is rather unfortunate in that we
>require all other users to log off which in effect means that we have to
>make our timings during the night.

If you are hell-bent on timing *only* your application, you have little
choice but to dedicate your machine to that task. That means kicking off
all the users and booting the machine with a minimal number processes
active - essentially in single user mode. You wouldn't want the results
to be affected by a tty interrupt or a packet arriving from the ethernet
or whatever. This will give you accurate results but is somewhat
extreme.

Note that there will always be some kernel latency. [This is because of
the way the UNIX kernel works.] Your application process may have to
switch into kernel mode to do some interrupt handling, so unless you can
disable the interrupts, your results will be affected by the demands of
interrupt servicing.

Do you *really* need that accuracy? Most people treat benchmarks with
deserved scepticism so your super-accurate results are only likely to be
interpreted as a rough rule of thumb. If that's all your looking for,
you might as well run the timings while the machine is in use.
Personally speaking, I'd be happier if I knew that timings included the
noise of other system activity. I don't run applications on an empty
machine, so why should I run benchmarks in that way. To me, the absolute
numbers are not important - it's more useful to know what to expect when
the system's doing real work...

		Jim

luis@octopus.tds.kth.se (Luis Barriga) (02/15/90)

>We want to get accurate timings of the run-times of Prolog programs when we
>use 1, 2, ... 11 processors in order to accurately measure the speed-ups.
I may thinks of two ways more of solving this problem.

1) You can use the "at" facility to schedule a job at any tile of any
day, while hopefully nobody is logged into your sequent (maybe Friday
or Saturday night). This works if your application does not take days
to finish and there is no interaction with the user. If the input is
the same you can prepare it in a file and redirect it from standart
input.

2) I have read about a system call "getrusage" that gives info about
resource utilization: system, user time consumed, and other stuff. Is
there any problem using it?
--
________________________________________________________________________|
   Luis Barriga			     The Royal Institute of Technology	|
				     Dep. Computer systems (TDS)	|
   e-address: luis@tds.kth.se	     S-100 44 Stockholm			|
				       SWEDEN     			|
________________________________________________________________________|

jim@cs.strath.ac.uk (Jim Reid) (02/15/90)

In article <LUIS.90Feb15104438@molly.octopus.tds.kth.se> luis@octopus.tds.kth.se (Luis Barriga) writes:
>2) I have read about a system call "getrusage" that gives info about
>resource utilization: system, user time consumed, and other stuff. Is
>there any problem using it?

No, but the numbers it gives may not be reproducible. Paging statistics
may be influenced by the amount of free memory when the process is run.
The system time will probably include time spent processing interrupts
*on behalf of other processes*. Likewise the I/O statistics may include
counts for I/O for another process (i.e. starting the next disk transfer
request in the queue after servicing an interrupt from the controller).
If you plan on using getrusage for real, run your program several times
to average out these potential inconsistencies.

		Jim

peralta@pinocchio.Encore.COM (Rick Peralta) (02/16/90)

In article luis@octopus.tds.kth.se (Luis Barriga) writes:
>
>2) I have read about a system call "getrusage" that gives info about
>resource utilization: system, user time consumed, and other stuff. Is
>there any problem using it?

It should be fine, just as long as you peek at it at the beginning and
end of your timing runs.  Otherwise you will get a little startup cost
into the overall measurements.  BTW: you'll need to sync the starts.

If you are interested in just general trends, just use the wall clock
time and test 0, 1 and many iterations of the code path.  That way you
can estimate fairly accurately how long things take, without going to
tremendous effort coding out the startup costs.

 - Rick

david@torsqnt.UUCP (David Haynes) (02/16/90)

peralta@pinocchio.Encore.COM (Rick Peralta) writes:

>In article luis@octopus.tds.kth.se (Luis Barriga) writes:
>>
>>2) I have read about a system call "getrusage" that gives info about
>>resource utilization: system, user time consumed, and other stuff. Is
>>there any problem using it?

>It should be fine, just as long as you peek at it at the beginning and
>end of your timing runs.  Otherwise you will get a little startup cost
>into the overall measurements.  BTW: you'll need to sync the starts.

There was some work done at the University of Western Ontario utilizing
a program called "gun" which provided synchronized starts. The basic
strategy, as it was explained to me, involved having the processes watch
a memory location to see a change of state. 

-david-
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
David Haynes			Sequent Computer Systems (Canada) Ltd.
"...and this is true, which is unusual for marketing." -- RG May 1989
...!{utgpu, yunexus, utai}!torsqnt!david -or- david@torsqnt.UUCP

eugene@eos.UUCP (Eugene Miya) (02/16/90)

In article <11171@encore.Encore.COM> peralta@multimax.encore.com (Rick Peralta) writes:
>BTW: you'll need to sync the starts.

Very much so.  I wrote a paper on this in the 88 Usenix Supercomputing
Workshop.

--eugene

carl@aerospace.aero.org (Carl Kesselman) (02/24/90)

No, no no. The responses being sent are wrong.  How about RTFM.  There
is a facility called tmp_affinity.  You can use this to ensure that a
process runs on a specific processor.  If you have the appropiate
kernel configuration, this can be done without root permission.  You
might also want to dissable swapping and pff reduction of the working
set using the vm_ctl system call.  This does not completly free you
from interaction with other programs, but it is pretty good.  In
particular, you still must share the system bus and disk controllers
with other processes.  In addition, you should always leave a
processor or two free to handle interupts, network traffic and the
like.  NOTE: there is about a 20 microsecond overhead if you use the
function call interface described in the manual page.  There is also a
set of macros defined that can allow you to access the clock in about
6 usecs.


Carl

jim@cs.strath.ac.uk (Jim Reid) (02/26/90)

In article <67411@aerospace.AERO.ORG> carl@altair.UUCP (Carl Kesselman) writes:
>No, no no. The responses being sent are wrong.  How about RTFM.  There
>is a facility called tmp_affinity.  You can use this to ensure that a
>process runs on a specific processor.

Yes, but that is not much help to the person who posed the initial
question. They were looking for a way to *exclusively* dedicate a
processor (or bunch of processors) to a particular process. The
tmp_affinity() system call does not provide this facility. It can bind
a process to a particular processor, but does not prevent another
process from also being bound to that processor or for the processor to
be given over to interrupt servicing and related kernel processing. If
the processor had to switch from the process that was being timed, the
benchmark results would be inconsistent and possibly non-reproducible.

I understand that Encore provide this capability on their boxes, so maybe
DYNIX will get it some day.

		Jim

bakken@cs.arizona.edu (Dave Bakken) (02/27/90)

In article <2151@baird.cs.strath.ac.uk> jim@cs.strath.ac.uk writes:
>Yes, but that is not much help to the person who posed the initial
>question. They were looking for a way to *exclusively* dedicate a
>processor (or bunch of processors) to a particular process. The
>tmp_affinity() system call does not provide this facility. It can bind
>a process to a particular processor, but does not prevent another
>process from also being bound to that processor or for the processor to
>be given over to interrupt servicing and related kernel processing. If
>the processor had to switch from the process that was being timed, the
>benchmark results would be inconsistent and possibly non-reproducible.
 
>I understand that Encore provide this capability on their boxes, so maybe
>DYNIX will get it some day.

I hope so, but I won't hold my breath, since Sequent seems to be about
2-3 years behind every other multiprocessor vendor in terms of their
OS.  I think that, in general, it would be nice to have a (privileged)
system call that allowed you to dedicate a certain number of processors
to a certain job without rebooting or anything.  Even more generally, I
think it would be useful to a lot of people of you could create separate
pools of processors, where processors in the same pool pulled processes off
of the same ready queue, and where you can configure which jobs go to
which pool in a flexible way, maybe based on some sort of grouping (e.g., 
faculty, grad_student, undergrad).  This grouping could probably be both
static and dynamic.
-- 
Dave Bakken				Internet:  bakken@cs.arizona.edu
721 Gould-Simpson Bldg			UUCP:	   uunet!arizona!bakken
Dept of Computer Science; U of Arizona 	Phone:	   +1 602 621 8372 (w)
Tucson, AZ 85721   USA			FAX:	   +1 602 621 4246

chowe@bbn.com (Carl Howe) (02/28/90)

In article <2151@baird.cs.strath.ac.uk> jim@cs.strath.ac.uk writes:
>In article <67411@aerospace.AERO.ORG> carl@altair.UUCP (Carl Kesselman) writes:
>>No, no no. The responses being sent are wrong.  How about RTFM.  There
>>is a facility called tmp_affinity.  You can use this to ensure that a
>>process runs on a specific processor.
>
>Yes, but that is not much help to the person who posed the initial
>question. They were looking for a way to *exclusively* dedicate a
>processor (or bunch of processors) to a particular process. The
>tmp_affinity() system call does not provide this facility. It can bind
>a process to a particular processor, but does not prevent another
>process from also being bound to that processor or for the processor to
>be given over to interrupt servicing and related kernel processing. If
>the processor had to switch from the process that was being timed, the
>benchmark results would be inconsistent and possibly non-reproducible.
>
>I understand that Encore provide this capability on their boxes, so maybe
>DYNIX will get it some day.
>

I apologize in advance if this is too far off the topic of Sequent
computers, but since you already brought up one other vendor,
I thought I'd just mention an implementation of both capabilities.
The facility that you both are describing exists in
nX, the UNIX OS for the BBN Advanced Computers' multiprocessors.
Affiliating a process with a processor
is done via a "fork_and_bind" system call.  Dedication of processors
to processes is done via a facility called "clusters", which are dynamically
created and destroyed.  When user's originally log in, they work in a
public cluster of processors, but if they want a dedicated set of processors
to run a program, they can use the command,

	cluster <#processors> <command> <args>

to get such a cluster.  The processors for this cluster are allocated out
of a free pool and will be returned to the pool after the execution of
the command.  We use this type of facility all the time for
benchmarking and characterizing multiprocessor programs without disrupting
the work of other users.  No special privileges are required.

There are a few descriptions of this facility in the literature if
you are interested in more details.  I apologize in advance that I can't
quote them here because my office is currently packed in boxes.  Email me
if you would like them when I get unpacked.  If people would like to discuss
or post regarding this further, we should probably move to comp.parallel.

Carl
chowe@bbn.com.