speck@cit-vax.ARPA (07/25/84)
From: Don Speck <speck@cit-vax.ARPA> Mkfs (and hence newfs) set the rotational delay parameter of a new filesystem to 4ms. One presumes that this value was chosen to give the maximum transfer rate from the disk. It appears to be the same value used to generate the results in the paper "A Fast File System for Unix", where they got a throughput of 116 4Kbyte blocks per second from a disk with 16Kbytes per track and 16.67ms per revolution (hence 4.17ms between blocks). But I've been unable to duplicate their results. On our CDC 9775 Winchester, I can't get more than about 40 4Kbyte blocks per second with the default parameters. A 9775 spins at the same rate and has the same number of blocks per track as the disk in the paper. I made a filesystem on an "a" partition with "newfs -b 4096 -f 1024", created a file with 6400K-bytes of garbage, and read it with a trivial program that does read() in a tight loop and nothing else, with read sizes of 2^n (11 <= n <= 16), and got the same results for each n tried: a throughput of only about 40 4Kbyte blocks per second (idle machine). After I do "tunefs -d 7" on the filesystem and recreate the file, the throughput almost doubles. Apparently 4ms is just not enough time for the machine to get ready to read the next block. Trying it on an Eagle, about half of the 4Kbyte transfers take more than 5ms to set up. Is there something wrong with our configuration that makes it take longer than 4ms to set up a 4K-byte transfer? The configuration is: a CDC 9775 mapped as two RM05's, a CDC 9766, and an Eagle all on one SI 9900 controller on mba0, TU77 tape drive on mba1, one unibus with: 5 Dec DZ's, 3 Ether boards (en0, il0, il1), LH/DH imp interface, two brand-X LP-11's, and a Versatec; VAX/780 cpu running 4.2bsd, kernel options: COMPAT,INET,PUP,QUOTA, and 16 megabyte data space. No quotas on the partitions tested. Test was done on a quiet Sunday evening when there was 97% idle time and no other disk I/O aside from /etc/update. Does anyone see a reason why our 780 can't get as much filesystem throughput as the 750 that generated the results in the paper? Don Speck
bass@dmsd.UUCP (John Bass) (07/26/84)
For most folks the purpose of interleave factors is pure magic set by trial and error. Nothing could be farther from the truth: The interleave factor is a skewing in the physical placement or logical numbering of sectors to match the cpu service time to the rotation time between consecutively read sectors. ... hmm ok so what? cpu services times: interrupt latency + device driver interrupt processing time + device driver transfer start time + ( possible reschedule and context switch +) ( cpu time for higher priority kernel processes +) ( cpu time for higher priority interrupts +) read system call time + application processing time = interlace time in msec. interlace time is: (sectors per track / rotation time in msec) * interlace time in msec = interlace time in sectors interleave factor is: interlace time in sectors + 1 Gotcha's: 1) Missing the interlace timing results in one full rotation (plus a little) of lost time. Net throughput approaches 1 sector per rotation depending on the frequency of misses. 2) Since most disks rotate at 60 times per second, the typical clock frequency then causes clock interrupts (callout processing and possible wakeups) to occur using 1-3 sectors of interlace time per revolution. Thus minimizing use of high frequency callouts and the cpu time they consume is mandatory (new device driver programmers seldom worry about this). If the interlace factor is set exactly (not counting clock interrupt times), then one rotation time will be lost per clock interrupt ... if there is little or no callout processing, then increasing the interleave factor one or two will cover clock interrupts with no reduction in throughput. 3) Serial receive and transmitt interrupts for 9600 baud occur at one msec intervals for DZ and SIO type devices, and for 19200 baud occur at 500 usec intervals. Thus a single 9600 baud line will, when active, invoke about 18 interrupts per rotation at 3600rpm ... for most 5mb 5-1/4 drives this is one interrupt per sector (512byte) on a track, and for 10mb drives one interrupt per (1k byte) block on a track. To prevent large step reductions in thruput when terminals are active on Programmed I/O lines, the interlace factor must be adjusted to allow some average number of lines to be active. NOTE: Peusdo DMA is almost manditory to get interrupt service times down to 50-150usec/char over the normal 500-1500usec/char of generalized C coded service routines. 4) Serial cputimes for either Peusdo or real DMA approaches are in the the area of 50-400usec/char that occur once per buffer done. Since dma buffer lengths are often 16-32 bytes this is basicly one long completion interrupt per tty line during each rotation. The net effect: to prevent large step reductions in disk throughput the interleave factor must also be adjusted to cover the average number of tty lines active. 5) Input traffic from other computers adds substatial load when serviced in raw/cbreak mode. Every input character requires wakeup processing at interrupt service time. 6) Readahead has little effect on reducing the interleave factor. The net effect is that it allows the service times for two consecutive sectors to be averaged dynamically, resulting in fewer misses due to infrequent events. 7) All of this scales in a non-linear fashion depending on cpu speed. All this sounds hopeless? ... For single user workstations at most one line is active ... for larger multiuser systems disk throughput is generally terrible (and resulting response times). Setting the proper interlace factor requires a combination of measurements from a good logic analyzer and tradeoff decisions after doing a good performance/control flow study. The only fix is to use disk controllers that can handle multiple outstanding requests -- few hardware systems handle this. The above is a general view on interlace factors ... more important to traditional 512/1kbyte filesystems ... but still a non-trival problem for tuning 4.2 filesystems. I will be giving a talk at the annual UNIOPS Conference in San Francisco next week (8/2/84) which goes into a lot of detail on filesystem performance issues ... of which interlace factors is a minor but important part. John Bass (Systems Performance and Arch Consultant) {dual,fortune,hpda,idi}!dmsd!bass 408-996-0557