[net.unix-wizards] Fast File System throughput

speck@cit-vax.ARPA (07/25/84)

From:  Don Speck <speck@cit-vax.ARPA>

    Mkfs (and hence newfs) set the rotational delay parameter of a new
filesystem to 4ms.  One presumes that this value was chosen to give the
maximum transfer rate from the disk.  It appears to be the same value
used to generate the results in the paper "A Fast File System for Unix",
where they got a throughput of 116 4Kbyte blocks per second from a disk
with 16Kbytes per track and 16.67ms per revolution (hence 4.17ms between
blocks).
    But I've been unable to duplicate their results.  On our CDC 9775
Winchester, I can't get more than about 40 4Kbyte blocks per second
with the default parameters.  A 9775 spins at the same rate and has
the same number of blocks per track as the disk in the paper.
    I made a filesystem on an "a" partition with "newfs -b 4096 -f 1024",
created a file with 6400K-bytes of garbage, and read it with a trivial
program that does read() in a tight loop and nothing else, with read
sizes of 2^n (11 <= n <= 16), and got the same results for each n tried:
a throughput of only about 40 4Kbyte blocks per second (idle machine).
After I do "tunefs -d 7" on the filesystem and recreate the file, the
throughput almost doubles.  Apparently 4ms is just not enough time for
the machine to get ready to read the next block.  Trying it on an Eagle,
about half of the 4Kbyte transfers take more than 5ms to set up.
    Is there something wrong with our configuration that makes it take
longer than 4ms to set up a 4K-byte transfer?  The configuration is:
a CDC 9775 mapped as two RM05's, a CDC 9766, and an Eagle all on one
SI 9900 controller on mba0, TU77 tape drive on mba1, one unibus with:
5 Dec DZ's, 3 Ether boards (en0, il0, il1), LH/DH imp interface, two
brand-X LP-11's, and a Versatec;  VAX/780 cpu running 4.2bsd, kernel
options:  COMPAT,INET,PUP,QUOTA, and 16 megabyte data space.  No quotas
on the partitions tested.  Test was done on a quiet Sunday evening when
there was 97% idle time and no other disk I/O aside from /etc/update.
    Does anyone see a reason why our 780 can't get as much filesystem
throughput as the 750 that generated the results in the paper?

					Don Speck

bass@dmsd.UUCP (John Bass) (07/26/84)

For most folks the purpose of interleave factors is pure magic set by
trial and error. Nothing could be farther from the truth:

The interleave factor is a skewing in the physical placement or logical
numbering of sectors to match the cpu service time to the rotation time
between consecutively read sectors. ... hmm ok so what?

cpu services times:

	interrupt latency +
	device driver interrupt processing time +
	device driver transfer start time +
	( possible reschedule and context switch +)
	( cpu time for higher priority kernel processes +)
	( cpu time for higher priority interrupts +)
	read system call time +
	application processing time =

	interlace time in msec.

interlace time is:

	(sectors per track / rotation time in msec) * interlace time in msec =

	interlace time in sectors

interleave factor is:

	interlace time in sectors + 1

Gotcha's:

	1) Missing the interlace timing results in one full rotation (plus
	   a little) of lost time. Net throughput approaches 1 sector per
	   rotation depending on the frequency of misses.

	2) Since most disks rotate at 60 times per second, the typical clock
	   frequency then causes clock interrupts (callout processing and
	   possible wakeups) to occur using 1-3 sectors of interlace time
	   per revolution.

	   Thus minimizing use of high frequency callouts and the cpu time
	   they consume is mandatory (new device driver programmers seldom
	   worry about this).

	   If the interlace factor is set exactly (not counting clock interrupt
	   times), then one rotation time will be lost per clock interrupt
	   ... if there is little or no callout processing, then increasing the
	   interleave factor one or two will cover clock interrupts
	   with no reduction in throughput.

	3) Serial receive and transmitt interrupts for 9600 baud occur
	   at one msec intervals for DZ and SIO type devices, and for
	   19200 baud occur at 500 usec intervals. Thus a single 9600
	   baud line will, when active, invoke about 18 interrupts per rotation
	   at 3600rpm ... for most 5mb 5-1/4 drives this is one interrupt
	   per sector (512byte) on a track, and for 10mb drives one interrupt
	   per (1k byte) block on a track.

	   To prevent large step reductions in thruput when terminals
	   are active on Programmed I/O lines, the interlace factor
	   must be adjusted to allow some average number of lines to be
	   active.

	   NOTE: Peusdo DMA is almost manditory to get interrupt service times
	   down to 50-150usec/char over the normal 500-1500usec/char of
	   generalized C coded service routines.

	4) Serial cputimes for either Peusdo or real DMA approaches are in the
	   the area of 50-400usec/char that occur once per buffer done.
	   Since dma buffer lengths are often 16-32 bytes this is basicly
	   one long completion interrupt per tty line during each rotation.
	   The net effect: to prevent large step reductions in disk
	   throughput the interleave factor must also be adjusted to
	   cover the average number of tty lines active.

	5) Input traffic from other computers adds substatial load
	   when serviced in raw/cbreak mode. Every input character
	   requires wakeup processing at interrupt service time.

	6) Readahead has little effect on reducing the interleave factor.
	   The net effect is that it allows the service times for two
	   consecutive sectors to be averaged dynamically, resulting in
	   fewer misses due to infrequent events.

	7) All of this scales in a non-linear fashion depending on cpu speed.


All this sounds hopeless? ... For single user workstations at most one
line is active ... for larger multiuser systems disk throughput
is generally terrible (and resulting response times).

Setting the proper interlace factor requires a combination of measurements
from a good logic analyzer and tradeoff decisions after doing a good 
performance/control flow study.

The only fix is to use disk controllers that can handle multiple outstanding
requests -- few hardware systems handle this.

The above is a general view on interlace factors ... more important to
traditional 512/1kbyte filesystems ... but still a non-trival problem
for tuning 4.2 filesystems.

I will be giving a talk at the annual UNIOPS Conference in San Francisco
next week (8/2/84) which goes into a lot of detail on filesystem performance
issues ... of which interlace factors is a minor but important part.

John Bass (Systems Performance and Arch Consultant)
{dual,fortune,hpda,idi}!dmsd!bass        408-996-0557