lars@myab.UUCP (lars) (11/07/86)
In article <244@miduet.gec-mi-at.co.uk> jgh@gec-mi-at.co.uk (Jeremy Harris) writes: >After bflush finds a delayed-write block and asynch-writes it, it scans again >from the start of the freelist > >I assume this is for safety in case some other process inserts a delayed-write >block on the freelist, but who will do this? ... >o The disk driver strategy routine called by bwrite? I see no > reason for it to manipulate the freelist. ... >Jeremy Harris jgh@gec-mi-at.co.uk ...!mcvax!ukc!hrc63!miduet!jgh In fact, strategy usually manipulates the free list, or at least the free list pointers of its bp. These pointers are used for maintaining an internal (to strategy) sorted list of disk operations. This implies that bp->av_forw is not valid after a call to strategy with bp. This is easily solved with an extra variabel which "saves" the forward link before the call to strategy. We implemented this scheme, plus an extra test: After the whole free list was taken care of, the whole procedure was restarted if and only if any buffer really was "flushed". System performance on a sync was much improved. By the way, we use 2000 buffers (UNIX System V.2) with 8 Mbytes of RAM. What is the optimal number of buffers regarded to size of memory ? -- ______________________________________________________ Lars Pensjo {decvax,philabs}!mcvax!enea!chalmers!myab!lars
jgh@gec-mi-at.co.uk (Jeremy Harris) (11/13/86)
[2nd attempt at posting; apologies if anyone's seen it before.] After bflush finds a delayed-write block and asynch-writes it, it scans again from the start of the freelist. Code to do this is in AT&T SysV.2 (Vax), Uniplus SysV.2 (m68k) and BSD 4.2 (Vax). I assume this is for safety in case some other process inserts a delayed-write block on the freelist, but who will do this? o Not any other real process since the scheduler won't preempt the kernel-mode process doing the bflush unless it goes to sleep (which it doesn't). o The disk driver strategy routine called by bwrite? I see no reason for it to manipulate the freelist. o A disk interrupt? The only time iodone puts anything on the freelist is if the block was ASYNC. Which means that it is no longer DELWRI. That only leaves comms protocols which write direct to disk from kernel. Boy do you have a wierd system :-) So what am I missing? I'd hate to think that this is one of those simple but inefficient algorithms left over from when memory was expensive and buffer caches were small. When playing with a memory-rich system with 4096 buffers I have seen a sync(1) result in 830 blocks being written for a cost of 2.8 MILLION buffer headers scanned. Looks quadratic to me..... And the hack to change the behaviour is all of two lines. Most of the words above are trademarks, or otherwise owned by, somebody, somewhere. I don't speak for any of them. Jeremy Harris jgh@gec-mi-at.co.uk ...!mcvax!ukc!hrc63!miduet!jgh
paul@unisoft.UUCP (Paul Campbell) (11/18/86)
In article <245@miduet.gec-mi-at.co.uk> jgh@gec-mi-at.co.uk (Jeremy Harris) writes: > >After bflush finds a delayed-write block and asynch-writes it, it scans again >from the start of the freelist. > >I assume this is for safety in case some other process inserts a delayed-write >block on the freelist, but who will do this? >Jeremy Harris jgh@gec-mi-at.co.uk ...!mcvax!ukc!hrc63!miduet!jgh You are correct .... this is real inefficient and seems to be quadratic in the number of buffers. Unfortunatly your fix may NOT fix the problem. It depends on the drivers in your system. You see it is quite allowable for a driver's strategy routine to sleep, thus causing a process switch. Granted that most don't, and in fact only one that I have ever seen did (not written here). A much better solution is a two staged buffer search, create a local array of buffer pointers (say 40 odd). Now search through the free list for delayed write candidates and put them in the array until either you reach the end of the list or the array is full. If the array was empty then you are done otherwise go back through the array reexamining each buffer to see if it still needs to be written (if not ignore it) and if required write it. Now go back and through the freelist and fill another array. This way instead of making 1 pass through the free list for each one of N delayed write buffers (all done at high slp and chances are these are at the end of the freelist) you make N/(40 odd) passes. I guess you choose the array size from an estimate of the percentage DELAYED write buffers in your cache and the cache size. The later V.2/V.3 kernels don't suffer from this problem as badly due to the process that ages buffers and trickles them out over time rather than all at once during a sync(). By the way the original code is quite valid if one makes the assumption that your system only has a small number of buffers (say ~20). Paul Campbell ..!ucbvax!unisoft!paul