[comp.unix.wizards] System V file systems

chris@mimsy.UUCP (Chris Torek) (10/26/88)

In article <8332@alice.UUCP> debra@alice.UUCP (Paul De Bra) writes:
>Given a fast CPU, a not-very-intelligent disk controller and the optimal
>interleaving and file system gapsize, the performance is roughly linearly
>proportional to the block-size. ...

Block size is a large, and probably the largest, factor in actual I/O
performance on real Unix machines.  The BSD Fast File System's cylinder
group arrangement does have a non-negligible effect on at least one
thing, however: backups speed up by more than the ratio of block sizes
when switching from a V7/4.1BSD/SysV style file system.  Faster seeks
make the effect of cylinder groups less dramatic, but we still have
a number of old washtub drives in service (until they fail and would
be expensive to fix, as none are under service anymore).

>The main reason why block-size is the limiting factor is that both the
>OS and the disk-controller have only slightly more work handling an 8k
>block than a 1k block. So you don't hit the hardware speed-limit as soon
>with larger block-sizes.

It would help if the disk drivers were clever, and coalesced adjacent
block requests and/or read whole tracks at a time.  (A Fuji Eagle with
48 sectors/track is 3 4.3BSD 8K blocks, and with `-a 3' files might
be contiguous across one track often enough to make this worthwhile.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

hedrick@geneva.rutgers.edu (Charles Hedrick) (10/26/88)

>Given a fast CPU, a not-very-intelligent disk controller and the optimal
>interleaving and file system gapsize, the performance is roughly linearly
>proportional to the block-size. 

Two problems: 

(1) To get really good performance, you have to use a block size so
large that you waste lots of disk space on small files.  The BSD file
system can split the last block in a file into fragments. You get the
benefits of a big block size on all but the last block, and don't
waste the disk space.

(2) System V (at least SVr2, and I think also SVr3) uses a free list,
which it does not keep in order, so an active file system fragments
very soon.  The BSD file system is designed to avoid fragmentation.
Of course this problem will not show if you do your tests right after
creating the file system.

The BSD fast file system is more than just larger block sizes.  It
makes sense for SVr4 to support both the old and new, to allow people
to move between SVr3 and SVr4 during testing and conversion.  However
once you have committed to SVr4, I'd think you would want to move to
the fast file system.

henry@utzoo.uucp (Henry Spencer) (10/28/88)

In article <Oct.25.22.42.50.1988.1890@geneva.rutgers.edu> hedrick@geneva.rutgers.edu (Charles Hedrick) writes:
>(2) System V (at least SVr2, and I think also SVr3) uses a free list,
>which it does not keep in order, so an active file system fragments
>very soon.  The BSD file system is designed to avoid fragmentation.
>Of course this problem will not show if you do your tests right after
>creating the file system.

Or if you run your tests in a time-sharing environment, where the disk
heads are always on their way to somewhere else anyway.  If you read
the fine print, all the Berkeley performance tests were run single-user!!
We conjectured a long time ago that the only feature of the 4.2 filesystem
that matters much in a timesharing environment is the big block size; I
haven't yet seen any solid results (numbers, not anecdotes) that would
contradict this.
-- 
The dream *IS* alive...         |    Henry Spencer at U of Toronto Zoology
but not at NASA.                |uunet!attcan!utzoo!henry henry@zoo.toronto.edu

bostic@ucbvax.BERKELEY.EDU (Keith Bostic) (10/29/88)

In article <1988Oct27.173247.2789@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes:
> Or if you run your tests in a time-sharing environment, where the disk
> heads are always on their way to somewhere else anyway.

This depends solely on your job mix; I recently saw figures someone derived
from trying to decide how best to queue requests for a new disk driver.  The
sampled system normally showed no head movement between the original request
and subsequent/read-ahead requests.  If you have a system with an
overloaded/limited number of disks, your paradigm is much more likely to be
correct.

> We conjectured a long time ago that the only feature of the 4.2 filesystem
> that matters much in a timesharing environment is the big block size; I
> haven't yet seen any solid results (numbers, not anecdotes) that would
> contradict this.

Given the nebulousness of the word "timesharing", I suspect you never will.

--keith

chris@mimsy.UUCP (Chris Torek) (10/29/88)

>In article <Oct.25.22.42.50.1988.1890@geneva.rutgers.edu>
>hedrick@geneva.rutgers.edu (Charles Hedrick) notes that
>>... The BSD file system is designed to avoid fragmentation [of the
free list, eventually resulting in blocks being allocated `at random'].
>>Of course this problem will not show if you do your tests right after
>>creating the file system.

In article <1988Oct27.173247.2789@utzoo.uucp> henry@utzoo.uucp
(Henry Spencer) writes:
>Or if you run your tests in a time-sharing environment, where the disk
>heads are always on their way to somewhere else anyway.  If you read
>the fine print, all the Berkeley performance tests were run single-user!!
>We conjectured a long time ago that the only feature of the 4.2 filesystem
>that matters much in a timesharing environment is the big block size; I
>haven't yet seen any solid results (numbers, not anecdotes) that would
>contradict this.

I actually agree with this (in spite of the point others have noted as
to the weak definition of `time shared').  But it is important to
consider several things.  Not the least is that workstations (e.g.,
Suns) are virtually single-user.  Certainly there are servers and
daemons running, and you may be multiprocessing, but `on the average'
you tend not to have more than one process doing file system I/O.
Second, read-ahead blocks are moved into the buffer cache at the same
time as the block actually being read; if these are adjacent, the r.a.
block will come in more or less immediately, even if the disk then has
to move the heads elsewhere for someone else's page-outs.  So you get
`two for the price of one', as it were.  Also---and to us, this is not
the least important point---often, when the machine really *is* in
single user mode, you will want file reading to be as fast as possible,
so that your backups will finish soon and you can allow the next batch
of news to flow in.  The BSD FFS allocation policies do a fair job of
keeping files straight even in the presence of multiprocessed/timeshared
writes.

In other words, while I think that the large blocks are the most
important factor, I am not unhappy about all the rest of it.  (After
all, *I* did not have to write the code . . . :-) ---and I have not had
to do much to `maintain' it, either; Kirk did a good job of the code
[think that is a hint, Mike? :-) ])
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

jfh@rpp386.Dallas.TX.US (The Beach Bum) (10/30/88)

In article <26599@ucbvax.BERKELEY.EDU> bostic@ucbvax.BERKELEY.EDU (Keith Bostic) writes:
>In article <1988Oct27.173247.2789@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes:
>> Or if you run your tests in a time-sharing environment, where the disk
>> heads are always on their way to somewhere else anyway.
>
>                                     If you have a system with an
>overloaded/limited number of disks, your paradigm is much more likely to be
>correct.

In the real world, where more than one process is accessing the disks at
any given time, the heads are always in the wrong place.

If you localize all of the file information for a given file, as the Berkeley
Fast File System does, you only need access more than one file to break it.

I have never seen a realistic benchmark [ multi-process, multi-file, random
access ] validate the claims BSD FFS puts forward - except to the extent that
having the larger block size dictates.  And soon USG Unix will have 2K blocks
so expect that advantage to diminish.
-- 
John F. Haugh II                        +----Make believe quote of the week----
VoiceNet: (214) 250-3311   Data: -6272  | Nancy Reagan on Richard Stallman:
InterNet: jfh@rpp386.Dallas.TX.US       |          "Just say `Gno'"
UucpNet : <backbone>!killer!rpp386!jfh  +--------------------------------------

mash@mips.COM (John Mashey) (10/30/88)

In article <8338@rpp386.Dallas.TX.US> jfh@rpp386.Dallas.TX.US (The Beach Bum) writes:
.....
>I have never seen a realistic benchmark [ multi-process, multi-file, random
>access ] validate the claims BSD FFS puts forward - except to the extent that
>having the larger block size dictates.  And soon USG Unix will have 2K blocks
>so expect that advantage to diminish.

I don't have the benchmark either.  I do note that when we brought up
V.3 on our systems, we started with a vanilla port, intending to put
the FFS in later.  We did (8K blocks). Overall performance, responsiveness, etc,
in a multi-user environment went way up.  On 5-10mips machines,
the vanilla 1K block SYS V file system was tremendously disk-bound.
(Again, I don't have the numbers handy, but I remember what it felt like.)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

friedl@vsi.COM (Stephen J. Friedl) (10/31/88)

In article <8338@rpp386.Dallas.TX.US>, jfh@rpp386.Dallas.TX.US (The Beach Bum) writes:
> I have never seen a realistic benchmark [ multi-process, multi-file, random
> access ] validate the claims BSD FFS puts forward - except to the extent that
> having the larger block size dictates.  And soon USG Unix will have 2K blocks
> so expect that advantage to diminish.

These are available now.  System V Release 3.1.1 for the 3B15 has
had 2k blocks for some time, and Sys V Rel 3.2.1 for the 3B2 just
came out with it.

How hard is it for an instantiation of UNIX to support multiple
kinds of blocksizes?  I would think that keeping the blocksize in
the superblock would make it pretty easy, so I could use 1k blocks
for root, and (say) 8k for the /database partition with a dozen
files all > 1MB.  Currently it seems like a big deal for them
to come out with a new supported blocksize.

     Steve
-- 
Steve Friedl    V-Systems, Inc.  +1 714 545 6442    3B2-kind-of-guy
friedl@vsi.com     {backbones}!vsi.com!friedl    attmail!vsi!friedl
----Nancy Reagan on 120MB SCSI cartridge tape: "Just say *now*"----

djg@sequent.UUCP (Derek Godfrey) (11/01/88)

> How hard is it for an instantiation of UNIX to support multiple
> kinds of blocksizes?  I would think that keeping the blocksize in
> the superblock would make it pretty easy, so I could use 1k blocks
> for root, and (say) 8k for the /database partition with a dozen
> files all > 1MB.  Currently it seems like a big deal for them
> to come out with a new supported blocksize.
> 

Not difficult at all since the block size field of the super block is
32 bits wide. The code needed in fs/s5 is minimal - just changing a
few case statements to an alogrithm! (assuming you'r willing to increase the
size of a system buffer) The biggest effort however is
converting all the utilities that are still using BSIZE rather then
FsBSIZE. 
	

guy@auspex.UUCP (Guy Harris) (11/01/88)

>How hard is it for an instantiation of UNIX to support multiple
>kinds of blocksizes?

If you have the right buffer cache mechanism (or moral equivalent; cf.
SunOS 4.0, which uses the buffer cache only for control information,
using the pagins sytem for data caching), it's not that hard.

>I would think that keeping the blocksize in the superblock would make
>it pretty easy, so I could use 1k blocks for root, and (say) 8k for
>the /database partition with a dozen files all > 1MB.

That's basically what the BSD file system does.

>Currently it seems like a big deal for them to come out with a new
>supported blocksize.

That's because they *don't* have the right buffer cache mechanism, and
have to hack in a new buffer cache for 2KB file systems (although at
least both buffer caches are sort of subclasses of a more general
"buffer cache" class, so they do get to share some code).

With any luck, S5R4 will have the right buffer cache mechanism, namely
the BSD one (i.e., with any luck, they'll put the V7/S5 file system on
top of it, rather than having *both* the BSD *and* the V7/S5 buffer
cache to support the two different file systems), or moral equivalent
(cf. SunOS 4.0, whose VM subsystem will be in S5R4 - which will, like
SunOS 4.0, use it for data caching).

crossgl@ingr.UUCP (Gordon Cross) (11/01/88)

In article <6413@daver.UUCP>, dlr@daver.UUCP (Dave Rand) writes:
> Why is the System V.[23] file system _SO_ much slower than the
> BSD file system? On several systems I have, the disk performance
> seems dreadful. On one system, it is around 20K bytes per second.
> The best I have seen from System V is 200K per second - but the
> actual disk controller is capable of 1.5 megabytes per second!

The problem is with the file system organization itself so you will not be able
to "fix" it.  Any further explanation here would be far too lengthy but I can
direct you too an informative article on the subject:


          A Fast File System for UNIX

          Marshall Kirk McKusick, William N. Joy,
          Samuel J. Leffler, and Robert S. Fabry

          Computer Systems Research Group
          Computer Science Division
          Department of Electrical Engineering and Computer Science
          University of California, Berkeley
          Berkeley, CA 94720


Gordon Cross
Intergraph Corp.  Huntsville, AL

sl@van-bc.UUCP (pri=-10 Stuart Lynne) (11/01/88)

In article <917@vsi.COM> friedl@vsi.COM (Stephen J. Friedl) writes:
>In article <8338@rpp386.Dallas.TX.US>, jfh@rpp386.Dallas.TX.US (The Beach Bum) writes:
}> I have never seen a realistic benchmark [ multi-process, multi-file, random
}> access ] validate the claims BSD FFS puts forward - except to the extent that
}> having the larger block size dictates.  And soon USG Unix will have 2K blocks
}> so expect that advantage to diminish.
}
}These are available now.  System V Release 3.1.1 for the 3B15 has
}had 2k blocks for some time, and Sys V Rel 3.2.1 for the 3B2 just
}came out with it.
}

My obsolete Callan Unistar running Unisoft 5.0 (a *very* early variant of
System V, possibly about release 0 or -1) with vintage binaries from 1983/1984
supports 1, 2 and 4 block file systems (that's .5/1/2 kb).

I would suggest that various releases of System V have supported 2k blocks
as long as there has been a System V. It just seems up to the porting house
as to whether they thought it was needed for a particular machine and worth
using.

In the case of the Callan they provided the 2kb support for use with SMD
drives (although they will work on other drives as well). 

Unfortunately they shipped the system to generate .5kb blocks for all file
systems as the default. You have to gen your own to use either 1kb or 2kb.
To make things worse the boot ROM only knows about the .5kb blocks so you
are stuck with that for your root partition (it's a fixed size too).

At least on a slow 68010 with mediocre drives the difference between 1 and
2kb blocks is not that great (although both were a big improvement over
.5kb). I use 1kb to help minimize the impact of the block buffers on my 2MB
of RAM.


-- 
Stuart.Lynne@wimsey.bc.ca {ubc-cs,uunet}!van-bc!sl     Vancouver,BC,604-937-7532

dave@celerity.UUCP (David L. Smith) (11/01/88)

In article <917@vsi.COM> friedl@vsi.COM (Stephen J. Friedl) writes:
>How hard is it for an instantiation of UNIX to support multiple
>kinds of blocksizes?  I would think that keeping the blocksize in
>the superblock would make it pretty easy, so I could use 1k blocks
>for root, and (say) 8k for the /database partition with a dozen
>files all > 1MB.  Currently it seems like a big deal for them
>to come out with a new supported blocksize.

We support multiple blocks sizes on our new toy (the Model 500), ranging
from 4K to 256K (for big striped disks).  It was relatively straightforward,
except for some problems while doing development with several of the system
utilities that depend on the system blocksize for the size of their internal
buffers.  We also ferreted out quite a few "magic" blocksize numbers.

dwc@homxc.UUCP (Malaclypse the Elder) (11/05/88)

In article <917@vsi.COM>, friedl@vsi.COM (Stephen J. Friedl) writes:
> 
> How hard is it for an instantiation of UNIX to support multiple
> kinds of blocksizes?  I would think that keeping the blocksize in
> the superblock would make it pretty easy, so I could use 1k blocks
> for root, and (say) 8k for the /database partition with a dozen
> files all > 1MB.  Currently it seems like a big deal for them
> to come out with a new supported blocksize.
> 
the difficulty that i see is in the maintenance of the
buffer cache.  do you have separate buffer caches for
each size or somehow share them.  then there is the modification
of file system maintenance programs (have you looked at fsck
source lately).  not such a big deal but a consideration.

danny chen
att!homxc!dwc

guy@auspex.UUCP (Guy Harris) (11/06/88)

 >> How hard is it for an instantiation of UNIX to support multiple
 >> kinds of blocksizes? ...,

 >the difficulty that i see is in the maintenance of the
 >buffer cache.  do you have separate buffer caches for
 >each size or somehow share them.

Share them.  See 4.[23]BSD.