[comp.arch] MP Locking Fine Grain or Course -> Coarse

gdb@hare.udev.cdc.com (Jerry Branham) (05/30/90)

>  Does anyone have a handle on the level of locking that is necessary in
> the kernel in order to support mp hardware? Otherwords, at what point
> does it stop paying off to go from a single threaded kernel to where
> every shared structure is controlled by a spin lock.

 Many programmers are addressing an mp problem with the number of processors
being very large - ~250 processors. I meant it for around 10 or less.

 Programmers seem to agree that the master/slave approach was not useful for
a production system.  But the degree of fine grain locking necessary to avoid
contention was not spelled out.

 In some cases, atomic operations were useful.  You don't want to put
spin locks
on all structures.  Some structures need more than one lock, some don't need
more than one per group.

 Good advice was - "Think, measure, read, measure, think, then code".

 Many comments where base on intuition and talked about probable collisions
and interlocks on table entries rather than the entire table. In general, there
was no specific numbers about improvements on performance of specific bench-
marks when a kernel was more finely interlocked. Many studies were
done on wait times etc. but did not related this to total system throughput.
For example, if a number of cpus have to wait on a structure and the disk
is completely saturated anyway, there is no loss in the time for the
job mix as long as the CPUs are available to CPU saturated jobs (monitor
interlocked tasks that can sleep).

>
> 1) Based on your answer was such a low level of granularity found to be
> necessary from benchmark experience or from programmer experience and
> knowledge of the code?

Both,  lock contention studies were done to see which large grain locks
needed to be refined.

> 2) In order to demonstrate the advantage of low level locking are there
> any benchmarks or instructions to produce benchmarks that will make a
> high level interlocked kernel perform badly while a lower level of
> granularity kernel will run faster - cpu usage and/or real time?

 Most answers were parallel makes. Real parallel programs were NOT used.
Dining philosophers' problem in ADA (I am not familiar with this one) was
recommended. Also, terminal I/O if you are running more than one line at
high speed. Certain special benchmarks were run to test specific parts of
the kernel.

 One person said that "People from Encore may have published on this subject."

 Kernel profiling should use a microsecond clock and cannot be excluded
even by spl6() or spl7() calls.

 Thanks for the comments and of course, coarse is spelled coArse (sorry about
that).


(I speak for me only, etc.)                     Jerry Branham
                                                (612) 482-3853
                                                e-mail gdb@kronos.udev.cdc.com

aglew@dwarfs.csg.uiuc.edu (Andy Glew) (05/31/90)

>Kernel profiling should use a microsecond clock and cannot be
>excluded even by spl6() or spl7() calls.

But if the timer sampling tick excluded by some spl level, you can
still get useful information: if you are using flat profiling (instead
of per-procedure gprof style profiling) the blocked timer ticks
accumulate at the splx() or spl0() that unblocks the high spl.  So,
while you don't get too much detail, you can at least tell which (end
of a) critical section took up the time.
    (Of course, if you unblock only on return to the user you lose.)

--
Andy Glew, aglew@uiuc.edu