[comp.unix.wizards] bflush algorithm

lars@myab.UUCP (lars) (11/07/86)

In article <244@miduet.gec-mi-at.co.uk> jgh@gec-mi-at.co.uk (Jeremy Harris) writes:
>After bflush finds a delayed-write block and asynch-writes it, it scans again
>from the start of the freelist
>
>I assume this is for safety in case some other process inserts a delayed-write
>block on the freelist, but who will do this?
...
>o	The  disk driver  strategy  routine called by bwrite?   I see no
>	reason for it to manipulate the freelist.
...
>Jeremy Harris	jgh@gec-mi-at.co.uk	...!mcvax!ukc!hrc63!miduet!jgh

In fact, strategy usually manipulates the free list, or at least
the free list pointers of its bp. These pointers are used for
maintaining an internal (to strategy) sorted list of disk operations.

This implies that bp->av_forw is not valid after a call to
strategy with bp. This is easily solved with an extra variabel
which "saves" the forward link before the call to strategy.

We implemented this scheme, plus an extra test:
After the whole free list was taken care of, the whole procedure
was restarted if and only if any buffer really was "flushed".

System performance on a sync was much improved.

By the way, we use 2000 buffers (UNIX System V.2) with 8 Mbytes of RAM.
What is the optimal number of buffers regarded to size of memory ?
-- 
______________________________________________________
    Lars Pensjo
    {decvax,philabs}!mcvax!enea!chalmers!myab!lars

jgh@gec-mi-at.co.uk (Jeremy Harris) (11/13/86)

[2nd attempt at posting; apologies if anyone's seen it before.]

After bflush finds a delayed-write block and asynch-writes it, it scans again
from the start of the freelist. Code to do this is in AT&T SysV.2 (Vax),
Uniplus SysV.2 (m68k) and BSD 4.2 (Vax).

I assume this is for safety in case some other process inserts a delayed-write
block on the freelist, but who will do this?
o	Not any other real process since the scheduler won't preempt the
	kernel-mode  process  doing the  bflush  unless it goes to sleep
	(which it doesn't).
o	The  disk driver  strategy  routine called by bwrite?   I see no
	reason for it to manipulate the freelist.
o	A  disk  interrupt?   The only time  iodone puts anything on the
	freelist is if the  block was  ASYNC.  Which means that it is no
	longer DELWRI.
That only leaves comms protocols which write direct to disk from kernel.  Boy
do you have a wierd system :-)

So what am I missing?  I'd hate to think that this is one of those simple but
inefficient  algorithms  left over from  when memory was expensive and  buffer
caches  were small.  When playing with a memory-rich system with  4096 buffers
I have  seen a  sync(1)  result in  830  blocks  being written  for a  cost of
2.8 MILLION  buffer headers scanned.  Looks quadratic to me.....  And the hack
to change the behaviour is all of two lines.

Most of the words above are trademarks, or otherwise owned by, somebody,
somewhere. I don't speak for any of them.
Jeremy Harris	jgh@gec-mi-at.co.uk	...!mcvax!ukc!hrc63!miduet!jgh

paul@unisoft.UUCP (Paul Campbell) (11/18/86)

In article <245@miduet.gec-mi-at.co.uk> jgh@gec-mi-at.co.uk (Jeremy Harris) writes:
>
>After bflush finds a delayed-write block and asynch-writes it, it scans again
>from the start of the freelist. 
>
>I assume this is for safety in case some other process inserts a delayed-write
>block on the freelist, but who will do this?
>Jeremy Harris	jgh@gec-mi-at.co.uk	...!mcvax!ukc!hrc63!miduet!jgh

You are correct .... this is real inefficient and seems to be quadratic in the
number of buffers. Unfortunatly your fix may NOT fix the problem. It depends on
the drivers in your system. You see it is quite allowable for a driver's
strategy routine to sleep, thus causing a process switch. Granted that most
don't, and in fact only one that I have ever seen did (not written here).

	A much better solution is a two staged buffer search, create a local
array of buffer pointers (say 40 odd). Now search through the free list for
delayed write candidates and put them in the array until either you reach the
end of the list or the array is full.  If the array was empty then you are done
otherwise go back through the array reexamining each buffer to see if it still
needs to be written (if not ignore it) and if required write it. Now go back and
through the freelist and fill another array. This way instead of making
1 pass through the free list for each one of N delayed write buffers (all done
at high slp and chances are these are at the end of the freelist) you make
N/(40 odd) passes. I guess you choose the array size from an estimate of the
percentage DELAYED write buffers in your cache and the cache size.

	The later V.2/V.3 kernels don't suffer from this problem as badly due
to the process that ages buffers and trickles them out over time rather than
all at once during a sync().

	By the way the original code is quite valid if one makes the assumption
that your system only has a small number of buffers (say ~20). 


		Paul Campbell
		..!ucbvax!unisoft!paul