[comp.unix.questions] Berkeley file system tuning

banderso@sagpd1.UUCP (Bruce Anderson) (01/11/89)

We are in the process of switching over from an HP9000/540 which
uses a "Structured Directory Format" (proprietary HP) file system
where everything (files, inodes and swap space) are dynamically
allocated, to an HP9000/835 which uses a Berkeley file system with
everything statically allocated at system configuration and I am
wondering if anyone has any data on the effects of some of the
possible configuration tradeoffs.

First, I gather that using multiple disk sections is supposed to
increase speed but from a first glance it appears to do this by
giving you 5 or 6 places to run out of disk space rather than just one
and if you have a file system which is just a little bigger than
one of the sections you can waste incredible amounts of space (for
example if you have 31 MB of data you may have to use a 110 MB
section rather than a 30 MB section). My first question is:
how much speed do you gain by breaking everything up into small
chunks on the disk rather than using it as just one (or possibly
two) very large blocks?

My second question is: how much effect does changing the block and
fragment size have? The manual says that if you use an 8K block and
fragment size it speeds up the file system but wastes space. Does
anyone have a quantitative feel of how much the tradeoff is?

When allocating inodes, what kind of ratio of disk space to inodes
do people use? The default on the system is an inode for every 2KB
of disk space in a file system but this seemed like an awfully high
number of inodes. Is it?

This is probably HP specific but if you define multiple swap sections,
does it fill up the first before starting on the secondary ones or
does it use all in a balanced manner?  If the first then obviously
the primary swap space should be on the fastest drive but otherwise
it doesn't matter.

Any information would be appreciated. Post or mail as you wish.

Bruce Anderson  -  Scientific Atlanta, Government Products Division
...!sagpd1!banderso

dhesi@bsu-cs.UUCP (Rahul Dhesi) (01/12/89)

In article <310@sagpd1.UUCP> banderso@sagpd1.UUCP (Bruce Anderson) writes:
>First, I gather that using multiple disk sections is supposed to
>increase speed...

I've heard this said, but I don't see why breaking up a disk into
pieces will speed up access.  The only exception I can see is the rare
case when you have a big partition containing files that are almost
never accessed.  If this partition is at the end of the disk the disk
head almost never has to travel that far.

The main (4.3BSD-specific) reasons for having multiple disk partitions
are:  (a) dump and restore work on entire partitions, so the smaller a
partition the more flexible your backup procedures can be;
(b) filesystem parameters can be individually adjusted for partitions
in case you want to use different block sizes etc.; (c) to do swapping
on a disk you have to have a partition dedicated to that;  and (d) you
can protect the rest of the disk from filling up by giving a directory
like /usr/tmp (or /a/crash :-) its own filesystem.

I think the *most* popular reason for having disks partitioned in a
certain way is because that's how the operating system was configured
when you got it and it's too much trouble to change it.  That certainly
is why we have our disks partitioned the way they are.

breck@aimt.UU.NET (Robert Breckinridge Beatie) (01/12/89)

In article <5324@bsu-cs.UUCP>, dhesi@bsu-cs.UUCP (Rahul Dhesi) writes:
> In article <310@sagpd1.UUCP> banderso@sagpd1.UUCP (Bruce Anderson) writes:
> >First, I gather that using multiple disk sections is supposed to
> >increase speed...
> 
> I've heard this said, but I don't see why breaking up a disk into
> pieces will speed up access.  The only exception I can see is the rare
> case when you have a big partition containing files that are almost
> never accessed.  If this partition is at the end of the disk the disk
> head almost never has to travel that far.

I interpreted his question as referring to cylinder groups in the BSD Fast
File System.  There are two (performance related) reasons that I can
think of for Cylinder Groups.  How effective Cylinder Groups are, I cannot
say.  The BSD file system certainly seems faster than the old style file
system, but how much of that is due to the 8K (fs_bsize actually) block
size and how much is due to improved locality of reference resulting from
Cylinder Groups?

First: I think the BSD file system attempts to keep inodes that are all
referenced by the same directory in the same cylinder group.  This way
when you stat(2) all the files in a directory the inodes that the system
will have to read will probably be (somewhat) closer together.

Second: If the file system manages to keep all the blocks for a file in
the same cylinder group as that file's inode, then the seek distance from
inode to file-data will (typically) be smaller than in the old-stype
file system.  I'm not sure how big a win this is, since under the BSD
file system, the disk heads will have to seek across cylinder groups
all the time.


-- 
Breck Beatie	    				(408)748-8649
{uunet,ames!coherent}!aimt!breck  OR  breck@aimt.uu.net
"Sloppy as hell Little Father.  You've embarassed me no end."

larry@macom1.UUCP (Larry Taborek) (01/13/89)

From article <310@sagpd1.UUCP>, by banderso@sagpd1.UUCP (Bruce Anderson):

>My second question is: how much effect does changing the block and
>fragment size have? The manual says that if you use an 8K block and
>fragment size it speeds up the file system but wastes space. Does
>anyone have a quantitative feel of how much the tradeoff is?

I kept some old copies of 4.2BSD documentation from my old job and
in Volume 2 of the documentation they have a section on the 4.2BSD
file system (A FAST FILE SYSTEM FOR UNIX, McKusick, Joy, Leffler, 
and Fabry)  From it I have select the following:

Space used	% Waste	Organization
775.2mb		0	Data only, no seperation
807.8		4.2	Data only, 512 byte boundry
828.7		6.9	512 byte Block
866.5		11.8	1024 byte block
948.5		22.4	2048 byte block
1128.3		45.6	4096 byte block

It also states:

"The space overhead in the 4096 (byte block) / 1024 (byte fragment) new
file system organization is empirically observed to be about the same 
as in the 1024 byte old file system organization."  ... "The net
result is about the same disk utilization when the new file
systems fragment size equals the old file systems block size."

Thus by determining your fragment size, you can compare it to the table
above to determine your amount of wasted space.  You can also
determine wether you have 2, 4, or 8 fragments per block, but I
believe that 4 is about right.  To high a fragment to block count
(8), and the data from fragments may have to be copied up to 7 times
to rebuild into a block (this would happen when a file would
grow beyond the size that 7 fragments could hold, and the file system
would copy these fragments into a block).  To low a fragment to
block count (2), and the block/fragment concept isn't helping
very much.

They also post a table that seems to show me that there is not
all that much difference between a 4K block FS and 8K block FS 
in speed
differences.  Instead, they state that the biggest factor that
helps speed things up is keeping at least 10% of the partition
free.

>When allocating inodes, what kind of ratio of disk space to inodes
>do people use? The default on the system is an inode for every 2KB
>of disk space in a file system but this seemed like an awfully high
>number of inodes. Is it?

It depends.  The number of inodes to the size of the file system 
default is meant as a good rule of thumb for most partitions.  If you
plan on holding usenet information on a partition (lots of small
files), then you may wish to lower this to 1KB of disk space in file
system to inode.  On the other hand, if you have a few very large files
filling a partition, then you may wish to raise the parameter to 8KB
of disk space in file system per inode.  When you look at this sort
of problem, you begin to understand why there are partitions, and what
use they satisfy.

>This is probably HP specific but if you define multiple swap sections,
>does it fill up the first before starting on the secondary ones or
>does it use all in a balanced manner?  If the first then obviously
>the primary swap space should be on the fastest drive but otherwise
>it doesn't matter.

What I noticed on BSD systems I used to administer was that the 
SECOND swap area was used exclusively until it filled, and then
the swap overflow went to the first.  To me, this made sense as the
second swap area was on our second physical disk, which generally
has less i/o then the first physical disk is expeced to have.  (Any
comments to this are appreciated).
-- 
Larry Taborek	..grebyn!macom1!larry		Centel Federal Systems
		larry@macom1.UUCP		11400 Commerce Park Drive
						Reston, VA 22091-1506
						703-758-7000

chris@mimsy.UUCP (Chris Torek) (01/14/89)

In article <4787@macom1.UUCP> larry@macom1.UUCP (Larry Taborek) writes:
>... You can also determine wether you have 2, 4, or 8 fragments per
>block, but I believe that 4 is about right.  To high a fragment to
>block count (8), and the data from fragments may have to be copied
>up to 7 times to rebuild into a block (this would happen when a file
>would grow beyond the size that 7 fragments could hold, and the file
>system would copy these fragments into a block).  To low a fragment to
>block count (2), and the block/fragment concept isn't helping
>very much.

This is not quite how things work; and there is not too much reason to
worry about fragment expansion in 4.3BSD.  (It *is* a problem in 4.2BSD
if you use `vi', for instance, although just how much so varies.)

Only the last part of a file ever occupies a fragment.  When extending
a file, the kernel decides whether it needs a full block or whether a
fragment will suffice.  If a fragment will do, the kernel looks for an
existing block (in the right cg) that is already appropriately
fragmented.  If one exists and has sufficient space, it is used;
otherwise the kernel allocates a full block and carves it up.

In 4.3BSD, Kirk added an `optimisation' flag (space/time; tunefs -o)
which is normally set to `time'.  The kernel automatically switches it
to `space' if the file system becomes alarmingly fragmented, then back
to `time' when things are cleaned up.  This flag does not exist in
4.2BSD; in essence, 4.2 always chooses `space'.

Now, when expanding a file that already ends in a fragment to a new
size that can be a fragment, if the flag is set to `space', the kernel
uses the usual best-fit search.  But if the flag is set to `time', the
kernel finds a fragment that can be expanded in place to a full block,
or takes a full block if no such fragments exist.

All of this affects only poorly-behaved programs that write files a
little bit at a time.  In 4.2BSD, vi always wrote 1024 bytes, which in
a 4k/1k file system is as bad as possible.  It was possible for every
write system call to have to allocate a new set of fragments, copying
the data from the old fragments to the new.  In 4.3BSD, even such
programs only lose once per fragment expansion, because the next three
(in a 4:1 FS) can always be done in place (provided that fs->fs_optim
is FS_OPTTIME).  vi was fixed in 4.3BSD to write statb.st_blksize blocks.
(And enbugged at the same time: if st_blksize is greater than the
MAXBSIZE with which vi was compiled, it scribbles over some of its
own variables.  I keep telling them that compiling in MAXBSIZE is
wrong....  Yes, it *does* break, if you speak NFS with a Pyramid
for instance.)

[and on paging:]
>What I noticed on BSD systems I used to administer was that the 
>SECOND swap area was used exclusively until it filled, and then
>the swap overflow went to the first.  To me, this made sense as the
>second swap area was on our second physical disk, which generally
>has less i/o then the first physical disk is expeced to have.  (Any
>comments to this are appreciated).

No:  Swap space is created in dmmax-sized segments scattered evenly
across all paging devices; its allocation approximates a uniform random
distribution.  (See swfree() in /sys/sys/vm_sw.c and swpexpand() in
/sys/sys/vm_drum.c.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris