[comp.os.research] Question on thread/process scheduler interaction

fouts@lemming. (Marty Fouts) (10/28/88)

I have seen the following problem occur in a number of lightweight
process implementations and was wondering if anyone had a ready
solution:

N tasks (threads) are started on top of M processes on a
multiprocessor with P processors (N >= M >= P > 1).

The tasks involve synchronization which requires spending some amount
of time in a critical region.  This is usually implemented using a
semaphore protected critical region.

One task enters the critical region in one process.  While there, a
second task running on a different process spinlocks on the semaphore.
The two processes are actually running on different processors.

The process level scheduler preempts the process running the task in
the critical region leaving the second process at the spinlock.  The
worst case is when the process is preempted by a third process
belonging to the same job which also reaches the spinlock.

Marty
--
+-+-+-+     I don't know who I am, why should you?     +-+-+-+
   |        fouts@lemming.nas.nasa.gov                    |
   |        ...!ames!orville!fouts                        |
   |        Never attribute to malice what can be         |
+-+-+-+     explained by incompetence.                 +-+-+-+

rhys@uunet.UU.NET (Rhys Francis) (10/30/88)

I am about to do some simulations on solutions to the problem
fouts@lemming (Marty Fouts) describes in article <5296@saturn.ucsc.edu>
but in a more general context and would be interested in proposals and
performance results already known.

The more general problem I am investigating is the notion that the O.S.
of a large shared memory multiprocessor will need to vary the share of
CPU resource allocated to jobs over time. Hence a parallel program must
be implemented in a fashion which allows it to continue to execute effectively
as processors time-out and are re-allocated to other jobs.
This problem has to be addressed if such beasts are going to be used
in general purpose multiprogrammed and time-sharing environments.

It seems that the O.S. schedular needs knowledge of the thread level
scheduling structures which creates security/programming/performance problems
if applications wish to exercise control over thread scheduling.
It also seems that interrupt logic and handling needs to interfere at the
task level as well as at the process level. Finally, if tasks or processes
attempt self scheduling across loops (perhaps using Fetch and Add style hardware
support) time-out interrupts in the body of an iteration can be nasty.

Further questions concern appropriate O.S. job scheduling and CPU scheduling
policies as well as methods for real-time acquisition of accurate measures
of the CPU resource being used by a parallel application.

Dr. Rhys Steven Francis,          ACSnet:  CSnet:    rhys@latcs1.oz
Dept. of Computer Science,        ARPA:    rhys%latcs1.oz@uunet.uu.net
LaTrobe University,               JANET:   latcs1.oz!rhys@ukc
Bundoora, Vic 3083,               UUCP:    {enea,hplabs,mcvax,nttlab,ukc,uunet}!
Australia.                                 munnari!latcs1.oz!rhys
ISD +61 3 479-2504 (desk) ISD +61 3 479-2598 (dept)

edler@cmcl2.nyu.edu (Jan Edler) (11/15/88)

This topic is a few weeks old, but we had a some kind of news problem here
delayed my posting this.

In article <5296@saturn.ucsc.edu> fouts@lemming. (Marty Fouts) writes:
>N tasks (threads) are started on top of M processes on a
>multiprocessor with P processors (N >= M >= P > 1).
...
>One task enters the critical region in one process.  While there, a
>second task running on a different process spinlocks on the semaphore.
...
>The process level scheduler preempts the process running the task in
>the critical region leaving the second process at the spinlock.  The
>worst case is when the process is preempted by a third process
>belonging to the same job which also reaches the spinlock.

Our solution to this problem is a mechanism called "temporary nonpreemption".
A process informs the kernel of temporary conditions that make preemption
unadvisable.  The scheduler in the kernel honors this advice, up to a limit
to prevent abuse.  This solves the problem as long as the maximum time in
the critical secion is less than the limit the scheduler is willing to honor.

The implementation is very efficient.
Two special communication variables are used.  They are ordinary integers,
residing in the user's address space, but their location is known to the
scheduler.  Both are initialized to 0.  The first is set to a non-zero
value by the user to request temporary non-preemption, and the second is set
to a non-zero value by the scheduler to warn the user that preemption will
soon be required.  We assume the scheduler runs like an interrupt handler.

Here is pseudo-code:

	tempnopreempt()	/* request temporary nonpreemption */
	{
		var1++;
	}

	tempokpreempt()	/* relinquish temporary non-preemption */
	{
		if (--var1 == 0 && (temp=var2) != 0) {
			var2 = 0;
			yield();
			return temp;
		}
		return 0;
	}

Yield() is a new system call that reschedules the processor.
Because var1 is incremented and decremented, these calls nest.
When the scheduler wants to preempt the currently running process,
it executes code like this:

	if (var1 == 0)
		ok to preempt;
	else if (preemption not already pending for this process) {
		var2 = 1;	/* notify user */
		note preemption pending;
	}
	else if (preemption pending for maximum allowable time) {
		var2 = 2;	/* time is up! */
		ok to force preemption;
	}

The pending preemption status is cleared on actual preemptions and on
yields().  The purpose of the tempokpreempt() return value is to notify
the user (after the fact) if preemption was requested or forced.

Overhead is only a couple instructions in the common case where no preemption
would have occurred anyway, and only the overhead of yield() otherwise.
The most abusive a user can get is to lengthen a time slice by a small amount,
and this can be compensated for by shortening the next time slice.

Jan Edler
NYU Ultracomputer Project
edler@nyu.edu
...!cmcl2!edler
(212) 998-3353

fouts@lemming. (Marty Fouts) (11/16/88)

In article <5460@saturn.ucsc.edu> edler@cmcl2.nyu.edu (Jan Edler) writes:

   This topic is a few weeks old, but we had a some kind of news problem here
   delayed my posting this.

   In article <5296@saturn.ucsc.edu> fouts@lemming. (Marty Fouts) writes:
   >N tasks (threads) are started on top of M processes on a
   >multiprocessor with P processors (N >= M >= P > 1).
   ...
   >One task enters the critical region in one process.  While there, a
   >second task running on a different process spinlocks on the semaphore.
   ...
   >The process level scheduler preempts the process running the task in
   >the critical region leaving the second process at the spinlock.  The
   >worst case is when the process is preempted by a third process
   >belonging to the same job which also reaches the spinlock.

   Our solution to this problem is a mechanism called "temporary nonpreemption".
   A process informs the kernel of temporary conditions that make preemption
   unadvisable.  The scheduler in the kernel honors this advice, up to a limit
   to prevent abuse.  This solves the problem as long as the maximum time in
   the critical secion is less than the limit the scheduler is willing to honor.

   The implementation is very efficient.

   [More complete description of implementation deleted.]

The system I am investigating has this feature, along with an
additional "advantage":  When a process which is running a task is
about to be preempted by the os level scheduler it saves state, so
that the next process of the group which gets scheduled can pick up
the state and continue execution.  This is done as a means of avoiding
fixing the relationship of a task to a process.

The disadvantage in the current implementation is an addition
"feature" of the implementation:  Each time a process picks up the
work of a task which has been dropped because some other process has
been preempted by the OS, it picks it up at a synchronization point
which may (and generally does) lead to work being duplicated.  At the
language level in this implementation, the synchronization point
corresponds to the entry point of a subroutine.

For a suitably pathologic program (IE, the one I'm interested in
implementing) the behavior of this system exactly mimics the behavior
of the naive system I described in my original posting:  A process
enters the critical region of a task and informs the operating system
that it should not be preempted.  It then takes longer than the OS is
willing to allow to perform the critical region work (the pathology)
so that the process is forced out anyway.

Oh well, I guess it is time to restructure the problem to require less
work in the critical region.  One good thing is that my measurements
seem to confirm the claim for efficency.  Even on a very pathologic
case this system is much more efficent than the naive approach.

--
+-+-+-+     I don't know who I am, why should you?     +-+-+-+
   |        fouts@lemming.nas.nasa.gov                    |
   |        ...!ames!orville!fouts                        |
   |        Never attribute to malice what can be         |
+-+-+-+     explained by incompetence.                 +-+-+-+