[comp.sys.encore] Timing on the Multimax

finley@iris.ucdavis.edu (Curtis M. Finley) (11/12/88)

I'm studying the performance of a program which makes use of the 
multitasking library.  (I realize threads is supposed to be better 
than multitasking but I started the project before threads was 
available.)  I'm using /bin/time to look at real, user, and system 
times with various numbers of processes at various levels of work.  
I'm having trouble explaining some of the times.  I'd appreciate it if 
you could give me some help. 

What goes into "system time".  Do calls to the multitasking library 
accumulate system time or user time?  It would appear that time spent 
in task_init accumulates system time.  What about spin locks, 
task_start, task_stop, share_malloc, and task_self?  What about I/O 
for the management of virtual memory?  I assume the instructions which 
cause I/O for virtual memory cause an accumulation of system time.  Is 
the time for disk access and data transfer accumulated under system 
time or does it account for part of the discrepancy between real time 
and the sum of system and user time. 

My experience is that task_init can take a tremendous amount of system 
time when three or more processes are being started.  The amount of 
system time is directly related to the amount of shared memory being 
reserved.  Why does the amount of memory make a difference?  It would 
seem that the memory is virtual and that it should have no effect 
until accessed. 

I wrote a concurrent program using multitasking and ran the program 
several times varying the amount of work done in the concurrent tasks.  
The program reserves 35MB shared memory but doesn't use most of it.  A 
plot of work vs system time shows a "staircase" effect.  As more work 
is done, system time increases then levels off, increases then levels 
off, increases then levels off ....  The increases are about 0.4 to 
0.6 seconds.  The stepping occurs more when there are a large number 
of processes (20) and less when there are fewer (4).  Stepping is less 
pronounced when only 3MB shared memory is reserved.  One thought I had 
is that the step up is due to time required to keep track of memory 
for multiple processes at the time of a context switch.  This argument 
is flawed since we have six processors and the stepping still occurs 
when the number of processes is less than six.  Timings were done 
while the machine was practically idle and there should not be any 
context switches.  

Does UMAX handle processes started with task_init as a unit?  That is, 
are they time-sliced simultaneously or are they completely independent?  
Presumably they are independent but if they are not perhaps that could 
help explain the stepping described above. 

My timing studies indicate that programs using multitasking have to do 
a tremendous amount of parallel computation to out perform programs 
with a single thread of control.  This is mainly due to the large 
system overhead of starting the multitasking environment.   Is threads 
significantly better in this respect?

anand@amax.npac.syr.edu (Anand Rangachari) (11/12/88)

In article <3273@ucdavis.ucdavis.edu> finley@iris.ucdavis.edu (Curtis M. Finley) writes:
>I wrote a concurrent program using multitasking and ran the program
>several times varying the amount of work done in the concurrent tasks.
>The program reserves 35MB shared memory but doesn't use most of it.  A
>plot of work vs system time shows a "staircase" effect.  As more work
>is done, system time increases then levels off, increases then levels
>off, increases then levels off ....  The increases are about 0.4 to
>0.6 seconds.  The stepping occurs more when there are a large number
>of processes (20) and less when there are fewer (4).  Stepping is less
>pronounced when only 3MB shared memory is reserved.  One thought I had
>is that the step up is due to time required to keep track of memory
>for multiple processes at the time of a context switch.  This argument
>is flawed since we have six processors and the stepping still occurs
>when the number of processes is less than six.  Timings were done
>while the machine was practically idle and there should not be any
>context switches.

  Amazing, I thought that I was the only one who had a staircase
effect. My program is a kernel for a concurrent language. The
speedup is shown below.

Note: Speedup := T(1) / T(n)  where n is the number of processes.

Speed |
up    |
      |
      |
      |
      |
      |
3     |            *  *
      |
2     |      *  *
      |
1     |*  *
      +------------------------------------
       1  2  3  4  5  6     Number of processes

  The funny thing is that our Multimax has APCs in which each
processor has its own cache which makes it even more mystifying.
This effect is most noticeable in programs that need to make frequent
references to the main memory (ie cache is not effective).
  This effect occurs all the way to 18 processes (we have 18 processors).
Also no other users on the system.

>
>Does UMAX handle processes started with task_init as a unit?  That is,
>are they time-sliced simultaneously or are they completely independent?
>Presumably they are independent but if they are not perhaps that could
>help explain the stepping described above.

  This brings up another question about Umax. I know that on a single
processor computer running unix, the cpu is interrupted at regular
intervals by the timer to allow time slice scheduling. The question then
is: Is each processor interrupted every 200 msec for scheduling or is
it just one processor which is interrupted. Thus on our 18 processor system
if the latter were true, each processor will be interrupted only every
3.6 seconds.

>
>My timing studies indicate that programs using multitasking have to do
>a tremendous amount of parallel computation to out perform programs
>with a single thread of control.  This is mainly due to the large
>system overhead of starting the multitasking environment.   Is threads
>significantly better in this respect?

  This is what we have been studying for the last year while developing our
concurrent language. The speedup of your program depends on how much
time is wasted. On the multimax, there are two ways that time can be
wasted:

  1. Waiting for references from main memory
  2. Waiting on locks.

You can minimize the first one by making your threads refer only to very small
segments of memory. The second can be minimized by careful program design.

I have not measured how long it takes to start a thread in the supplied
package but in our language it takes about 200 microseconds.
If a thread communicates with another thread no more than once a millisecond,
your program will show a nearly linear speedup.


                                          R. Anand
 Internet: anand@amax.npac.syr.edu
 Bitnet:   ranand@sunrise

jeff@Alliant.COM (Jeff Collins) (11/15/88)

In article <832@cmx.npac.syr.edu> anand@amax.npac.syr.edu (Anand Rangachari) writes:
>  This brings up another question about Umax. I know that on a single
>processor computer running unix, the cpu is interrupted at regular
>intervals by the timer to allow time slice scheduling. The question then
>is: Is each processor interrupted every 200 msec for scheduling or is
>it just one processor which is interrupted. Thus on our 18 processor system
>if the latter were true, each processor will be interrupted only every
>3.6 seconds.
>
	Only one process is interrupted by the system clock.  The
system clock generates what is referred to as an arbitrated interrupt.
Arbitrated interrputs are handled by the "most eligable" processor.
This is determined by the hardware based on the number of other
interrupts being handled and some load leveling magic (I never did
understand all of the variable used in this arbitration).  This load
leveling insures that one CPU is not bombed by interrupts.  So, on
average, as you said - each CPU only handles one clock interrupt every
3.6 seconds.

corbin@pinocchio.Encore.COM (Steve Corbin) (11/18/88)

 In article <2623@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes:
 >In article <832@cmx.npac.syr.edu> anand@amax.npac.syr.edu (Anand Rangachari) writes:
 >>  This brings up another question about Umax. I know that on a single
 >>processor computer running unix, the cpu is interrupted at regular
 >>intervals by the timer to allow time slice scheduling. The question then
 >>is: Is each processor interrupted every 200 msec for scheduling or is
 >>it just one processor which is interrupted. Thus on our 18 processor system
 >>if the latter were true, each processor will be interrupted only every
 >>3.6 seconds.
 >>
 >	Only one process is interrupted by the system clock.  The
 >system clock generates what is referred to as an arbitrated interrupt.
 >Arbitrated interrputs are handled by the "most eligable" processor.
 >This is determined by the hardware based on the number of other
 >interrupts being handled and some load leveling magic (I never did
 >understand all of the variable used in this arbitration).  This load
 >leveling insures that one CPU is not bombed by interrupts.  So, on
 >average, as you said - each CPU only handles one clock interrupt every
 >3.6 seconds.

The Multimax has both a system clock and a per-processor timer.
The system clock is used for maintaining time of day, handling timed
queues, etc... .  The system clock occurs at the HZ rate set in the
boot parameter file.  These interrupts are "pooled" or "arbitrated"
as described above so that all CPUs in the system share the processing
burden.

The per-processor timers are used for pre-emptive Time Slicing.
If a process does not relinquish the CPU within its time slice limit,
then a TSE interrupt will be taken forcing a reschedule of that process.
The current limit on time slices is 200ms.  Thus it is possible that a
CPU can receive a TSE interrupt every 200ms (CPU bound) or never get one
at all (either I/O bound or making lots of system calls).  It is
application code dependent.
Stephen Corbin	    UUCP:	{bu-cs,decvax,necntc,talcott}!encore!corbin
		    Internet:	corbin@multimax.ARPA