finley@iris.ucdavis.edu (Curtis M. Finley) (11/12/88)
I'm studying the performance of a program which makes use of the multitasking library. (I realize threads is supposed to be better than multitasking but I started the project before threads was available.) I'm using /bin/time to look at real, user, and system times with various numbers of processes at various levels of work. I'm having trouble explaining some of the times. I'd appreciate it if you could give me some help. What goes into "system time". Do calls to the multitasking library accumulate system time or user time? It would appear that time spent in task_init accumulates system time. What about spin locks, task_start, task_stop, share_malloc, and task_self? What about I/O for the management of virtual memory? I assume the instructions which cause I/O for virtual memory cause an accumulation of system time. Is the time for disk access and data transfer accumulated under system time or does it account for part of the discrepancy between real time and the sum of system and user time. My experience is that task_init can take a tremendous amount of system time when three or more processes are being started. The amount of system time is directly related to the amount of shared memory being reserved. Why does the amount of memory make a difference? It would seem that the memory is virtual and that it should have no effect until accessed. I wrote a concurrent program using multitasking and ran the program several times varying the amount of work done in the concurrent tasks. The program reserves 35MB shared memory but doesn't use most of it. A plot of work vs system time shows a "staircase" effect. As more work is done, system time increases then levels off, increases then levels off, increases then levels off .... The increases are about 0.4 to 0.6 seconds. The stepping occurs more when there are a large number of processes (20) and less when there are fewer (4). Stepping is less pronounced when only 3MB shared memory is reserved. One thought I had is that the step up is due to time required to keep track of memory for multiple processes at the time of a context switch. This argument is flawed since we have six processors and the stepping still occurs when the number of processes is less than six. Timings were done while the machine was practically idle and there should not be any context switches. Does UMAX handle processes started with task_init as a unit? That is, are they time-sliced simultaneously or are they completely independent? Presumably they are independent but if they are not perhaps that could help explain the stepping described above. My timing studies indicate that programs using multitasking have to do a tremendous amount of parallel computation to out perform programs with a single thread of control. This is mainly due to the large system overhead of starting the multitasking environment. Is threads significantly better in this respect?
anand@amax.npac.syr.edu (Anand Rangachari) (11/12/88)
In article <3273@ucdavis.ucdavis.edu> finley@iris.ucdavis.edu (Curtis M. Finley) writes: >I wrote a concurrent program using multitasking and ran the program >several times varying the amount of work done in the concurrent tasks. >The program reserves 35MB shared memory but doesn't use most of it. A >plot of work vs system time shows a "staircase" effect. As more work >is done, system time increases then levels off, increases then levels >off, increases then levels off .... The increases are about 0.4 to >0.6 seconds. The stepping occurs more when there are a large number >of processes (20) and less when there are fewer (4). Stepping is less >pronounced when only 3MB shared memory is reserved. One thought I had >is that the step up is due to time required to keep track of memory >for multiple processes at the time of a context switch. This argument >is flawed since we have six processors and the stepping still occurs >when the number of processes is less than six. Timings were done >while the machine was practically idle and there should not be any >context switches. Amazing, I thought that I was the only one who had a staircase effect. My program is a kernel for a concurrent language. The speedup is shown below. Note: Speedup := T(1) / T(n) where n is the number of processes. Speed | up | | | | | | 3 | * * | 2 | * * | 1 |* * +------------------------------------ 1 2 3 4 5 6 Number of processes The funny thing is that our Multimax has APCs in which each processor has its own cache which makes it even more mystifying. This effect is most noticeable in programs that need to make frequent references to the main memory (ie cache is not effective). This effect occurs all the way to 18 processes (we have 18 processors). Also no other users on the system. > >Does UMAX handle processes started with task_init as a unit? That is, >are they time-sliced simultaneously or are they completely independent? >Presumably they are independent but if they are not perhaps that could >help explain the stepping described above. This brings up another question about Umax. I know that on a single processor computer running unix, the cpu is interrupted at regular intervals by the timer to allow time slice scheduling. The question then is: Is each processor interrupted every 200 msec for scheduling or is it just one processor which is interrupted. Thus on our 18 processor system if the latter were true, each processor will be interrupted only every 3.6 seconds. > >My timing studies indicate that programs using multitasking have to do >a tremendous amount of parallel computation to out perform programs >with a single thread of control. This is mainly due to the large >system overhead of starting the multitasking environment. Is threads >significantly better in this respect? This is what we have been studying for the last year while developing our concurrent language. The speedup of your program depends on how much time is wasted. On the multimax, there are two ways that time can be wasted: 1. Waiting for references from main memory 2. Waiting on locks. You can minimize the first one by making your threads refer only to very small segments of memory. The second can be minimized by careful program design. I have not measured how long it takes to start a thread in the supplied package but in our language it takes about 200 microseconds. If a thread communicates with another thread no more than once a millisecond, your program will show a nearly linear speedup. R. Anand Internet: anand@amax.npac.syr.edu Bitnet: ranand@sunrise
jeff@Alliant.COM (Jeff Collins) (11/15/88)
In article <832@cmx.npac.syr.edu> anand@amax.npac.syr.edu (Anand Rangachari) writes: > This brings up another question about Umax. I know that on a single >processor computer running unix, the cpu is interrupted at regular >intervals by the timer to allow time slice scheduling. The question then >is: Is each processor interrupted every 200 msec for scheduling or is >it just one processor which is interrupted. Thus on our 18 processor system >if the latter were true, each processor will be interrupted only every >3.6 seconds. > Only one process is interrupted by the system clock. The system clock generates what is referred to as an arbitrated interrupt. Arbitrated interrputs are handled by the "most eligable" processor. This is determined by the hardware based on the number of other interrupts being handled and some load leveling magic (I never did understand all of the variable used in this arbitration). This load leveling insures that one CPU is not bombed by interrupts. So, on average, as you said - each CPU only handles one clock interrupt every 3.6 seconds.
corbin@pinocchio.Encore.COM (Steve Corbin) (11/18/88)
In article <2623@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes: >In article <832@cmx.npac.syr.edu> anand@amax.npac.syr.edu (Anand Rangachari) writes: >> This brings up another question about Umax. I know that on a single >>processor computer running unix, the cpu is interrupted at regular >>intervals by the timer to allow time slice scheduling. The question then >>is: Is each processor interrupted every 200 msec for scheduling or is >>it just one processor which is interrupted. Thus on our 18 processor system >>if the latter were true, each processor will be interrupted only every >>3.6 seconds. >> > Only one process is interrupted by the system clock. The >system clock generates what is referred to as an arbitrated interrupt. >Arbitrated interrputs are handled by the "most eligable" processor. >This is determined by the hardware based on the number of other >interrupts being handled and some load leveling magic (I never did >understand all of the variable used in this arbitration). This load >leveling insures that one CPU is not bombed by interrupts. So, on >average, as you said - each CPU only handles one clock interrupt every >3.6 seconds. The Multimax has both a system clock and a per-processor timer. The system clock is used for maintaining time of day, handling timed queues, etc... . The system clock occurs at the HZ rate set in the boot parameter file. These interrupts are "pooled" or "arbitrated" as described above so that all CPUs in the system share the processing burden. The per-processor timers are used for pre-emptive Time Slicing. If a process does not relinquish the CPU within its time slice limit, then a TSE interrupt will be taken forcing a reschedule of that process. The current limit on time slices is 200ms. Thus it is possible that a CPU can receive a TSE interrupt every 200ms (CPU bound) or never get one at all (either I/O bound or making lots of system calls). It is application code dependent. Stephen Corbin UUCP: {bu-cs,decvax,necntc,talcott}!encore!corbin Internet: corbin@multimax.ARPA