[comp.os.misc] Contiguous files; extent based file systems

chris@mimsy.UUCP (Chris Torek) (12/17/87)

Personally, I have never cared whether my files were contiguous.  All
I care about is that they be reasonably fast to access.

Completely contiguous files are hard to build.  You must either
declare their size initially or else constantly rearrange disk
blocks.  Contiguous files tend to be fast to access, but if your
OS has a heavy multiuser load, it will often be using several files
at once, destroying the effect created by the contiguous files.

Extent based file systems are usually implemented with a small
maximum number of extents.  This means that the file size must
(implicitly, by extent size) be declared in advance, if the file
may be large.  Many contiguous file allocators actually work with
extents, since this allows a contiguous file to be grown in place.
Many extent based file systems force user code to work with extents,
rather than with bytes; this has a tendency to make programs
inflexible (they deal either with `files' or with `terminals' but
not both).  Some systems solve this fairly well with I/O libraries.

The Unix file interface does not prevent a system from doing
contiguous or otherwise well-scheduled allocation, although it does
make it difficult to declare a file's size in advance.  I consider
this a feature, not a bug: very often I have no idea how big the
file will be.  This facilitates using filters and pipes, where the
input file size is often unavailable.  The 4.2/4.3BSD block allocator
does a fairly good job of arranging the disk, although it uses more
compute power than one would like on, e.g., a Vax 11/750.  Other
allocators exist that also do a good job; the PJW file system in
Research Unix is supposed to work well (I have no firsthand
experience).  All of these manage without leaking the internal
disk organisation out to user programs, which means said organisation
can also be changed relatively easily.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

gupta@cullsj.UUCP (Yogesh Gupta) (12/18/87)

In article <9828@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
> Personally, I have never cared whether my files were contiguous.  All
> I care about is that they be reasonably fast to access.
> 

I definitely agree.  However, I find that if I create a 100MB file
under Unix (BSD 4.2, System V Rel 1), the overhead in
randomly accessing various parts of it is too high (due to indirect
inode structures).  Any comments?  What if I know that my file will
be of the order of 100MB and then declare an extent size of 1MB (wastage
of 0.5%, on the average) in an extent based file system?
-- 
Yogesh Gupta			| If you think my company will let me
Cullinet Software, Inc.		| speak for them, you must be joking.

jlc@didsgn.UUCP (jlc) (12/18/87)

In article <9828@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
> Completely contiguous files are hard to build.  You must either
> declare their size initially or else constantly rearrange disk
> blocks.  Contiguous files tend to be fast to access, but if your
> OS has a heavy multiuser load, it will often be using several files
> at once, destroying the effect created by the contiguous files.
> 
> Extent based file systems are usually implemented with a small
> maximum number of extents.  This means that the file size must
> (implicitly, by extent size) be declared in advance, if the file
> may be large.


If you are interested in contiguous file, here is an example of how it is done
on UN/V which is a real version of system V designed by Charles River Data.
I have personnaly used this implementation for real time and image processing software and it very very fast. (All the following is copyrighted by Charles
River Data System, etc etc....)


The following is extracted out of the open() subroutine manual entry

          NAME

          open()         opens a file for reading or writing

          FORMAT

          #include <fcntl.h>

          int open (path, oflag [, omodes [ , minext, maxext ] ])
          char *path;
          int oflag;
          int omodes, minext, maxext;

          DESCRIPTION

          open() opens a file descriptor  for  the named file and sets
          the file status flags according  to  the  value  of  o_flag.
          path points to a pathname naming a file.   o_flag values are
          composed  by  ORing  values  from the following,  which  are
          declared in <fcntl.h> (only one from the first three):

          O_RDONLY       Open for reading only.

          O_WRONLY       Open for writing only.

          O_RDWR         Open for reading and writing.

{ some verbiage cut here }

          O_CONTIG       Force a  contiguous  file  (a  file  which is
                         allocated disk space in large  extents).   If
                         the file  is  truncated  or  created  by this
                         open, it becomes  a  contiguous  file with an
                         initial  allocated  extent   of   extend_size
                         bytes.  Otherwise, the file must  already  be
                         contiguous   and   have   extent_size   bytes
                         available  for  growth  (a  new   extent   is
                         allocated if  necessary)  or  the  open  will
                         fail.    If   O_CONTIG  is   not   specified,
                         contiguous files can be  opened just like any
                         other file.

The following is extracted out of the command manual:

          contig              UNOS COMMANDS MANUAL              contig

          NAME

          contig         make a contiguous file

          FORMAT

          contig    [extent=range]     [alloc=rangelist]     [-freeze]
          [-truncate] file ...

          DESCRIPTION

          contig makes a contiguous file (a file which owns disk space
          in large extents rather than in single blocks).  If the file
          named already exists, it  must  be  truncated  in  order  to
          become contiguous; this command  saves  the  old contents of
          the file (in a file in  /tmp or in main memory, depending on
          size) and writes it back later.

          Contiguous files have two parameters,  min  and  max,  which
          govern allocation of disk space as the file grows.  Whenever
          a write past the end of the space currently used by the file
          is done, the file system allocates a  new extent of at least
          min blocks  and  at most max blocks (or at most large enough
          to contain the write, if that's larger).  The write fails if
          there is  no  free extent of at least min blocks.  There are
          system default values for min and max, which are used if the
          file's min and max are -1.  All allocation  is inhibited whn
          the max value is 0.

          OPTIONS

          extent=range   Sets  the  file's  min  and  max   allocation
                         parameters,  but does not cause any space  to
                         be  allocated.  range is a number (sets  both
                         min and max)  or  a pair of numbers separated
                         by '-' (e.g.  extent=1024-4096  sets  min  to
                         1024 and  max to 4096). The number(s) can end
                         in 'b' or 'k' to  indicate  units  of  blocks
                         (times 512) and kilobytes (times 1024).  This
                         multiplier will be  applied  to  both min and
                         max  even  if  only   max  specifies  it,  so
                         extent=10-20b   means   extent=10b-20b.    If
                         extent= is omitted, min  and  max  are set to
                         -1, which causes the system default values to
                         be used.

          alloc=rangelist
                         Allocates  one or more extents of disk  space
                         to the file.  rangelist is a range (as above)
                         or a list  of  ranges separated by ',' (e.g.,
                         alloc=10b,10b causes the file to have two ten
                         block extents).  If alloc= is omitted, contig
                         tries  to  get a single extent  to  hold  the
                         current file contents (if  any)  by  assuming
                         alloc=512-filesize.

          -freeze        Marks  the  file  to  prevent   any   further
                         allocation,  by  setting  exten  to  0.   Any
                         attempt   to  write  past  the  end  of   the
                         allocated  space   will   fail.    Since  the
                         filesize may indicate the file ends  anywhere
                         in the last  extent,  it is still possible to
                         extend  the file up to the allocation  limit.
                         The contig command always unfreezes files  if
                         -freeze   is  not  specified.   -freeze   and
                         extent= are mutually exclusive.

          -truncate      Discards any current file contents  and  disk
                         space.   Truncating  a  file  requires  write
                         access to it.

          EXAMPLES

          To turn all commands into contiguous files for speed:

               @>contig /bin/* /sys/* /usr/bin/*

          To prevent additional allocation  of  space  to a contiguous
          file:

               @>contig -freeze able

          To allocate 5 1000 block extents:

               @>contig -truncate
               alloc=1000b,1000b,1000b,1000b,1000b able

          To  cause  subsequent  allocations  to be 300  blocks  while
          allowing allocations down to 100 blocks to be acceptable  if
          300 blocks is not available:

               @>contig extent=100-300b able

          NOTES

          -freeze does not prevent the file from being truncated,  for
          example by '>  filename'  in  the  shell,  or  the  create()
          subroutine.  If a frozen file is truncated it will be frozen
          at zero length, so no write to it can succeed.

Hope this of interest....
  ___                    _               __                  _
 (   >                 _//              /  ) /       _/_    //
  __/_/> __.  ____ --- /    . . _.     /    /_  __.  /  _  // __.  o ____
 / / (__(_/|_/ / <__  /___ (_/_(__    (__/ / /_(_/|_<__</_</_(_/|_<_/ / <_
<_/

...gatech!rebel!didsgn!jlc

trt@rti.UUCP (Thomas Truscott) (12/18/87)

In article <177@cullsj.UUCP>, gupta@cullsj.UUCP (Yogesh Gupta) writes:
> ... .  However, I find that if I create a 100MB file
> under Unix (BSD 4.2, System V Rel 1), the overhead in
> randomly accessing various parts of it is too high (due to indirect
> inode structures).  Any comments?

Do you have, or would someone be willing to write, a benchmark to
demonstrate this?  For example a program that does I/O on
a 100 000 byte file vs a 100 000 000 byte file.

Sequential I/O certainly seems linear with file size.
E.g. I made a 20 000 000 byte and a 2 000 000 file
on our Gould which has a 1k/8k BSD filesystem
and copying the larger file took "exactly" 10 times longer to copy
(0.925 real 0.00 user 0.40 sys  per 1 000 000 bytes.)
I think random access would be the same.
The larger file needs ~ceil(20 000 000/(8192/4)) == 2 indirect blocks,
and any reasonable I/O system will keep both in the buffer
cache at all times.

Rather than making files contiguous I would prefer larger blocksizes.
For example blocksize == tracksize lets a smart disk controller
eliminate rotational delay.
	Tom Truscott

ron@topaz.rutgers.edu (Ron Natalie) (12/19/87)

First, there is no such thing as indirect inode.  Second, having contiguous
files are only good if you are the only one using the drive and you are
guaranteeing that you are reading the file in one single pass.  Far better
is recent Berkeley file systems with decent interleaving and clustering.

-Ron

jpdres10@usl-pc.UUCP (Green Eric Lee) (12/20/87)

Expires:

Sender:

Reply-To:

Followup-To:

In message <1931@rti.UUCP>, trt@rti.UUCP (Thomas Truscott) says:
>In article <177@cullsj.UUCP>, gupta@cullsj.UUCP (Yogesh Gupta) writes:
>> ... .  However, I find that if I create a 100MB file
>> under Unix (BSD 4.2, System V Rel 1), the overhead in
>> randomly accessing various parts of it is too high (due to indirect
>> inode structures).  Any comments?
>Sequential I/O certainly seems linear with file size.

Yes, because of block caching knarfing the indirect blocks.

>I think random access would be the same.

Not necessarily. You may need to re-read indirect inodes, or other
things of that sort. 

>The larger file needs ~ceil(20 000 000/(8192/4)) == 2 indirect blocks,
>and any reasonable I/O system will keep both in the buffer
>cache at all times.

BSD, perhaps. I'm at home, so I don't have access to the BSD FFS
documentation.  But, let's see what Sys V does (quick, flip to back of
Bach Book):
  Well, most Sys V systems that I've seen have 1K logical blocks
(physical blocks may be larger). A 100mb file would thus have 100,000
logical blocks. Single indirect would take us up to 266 blocks. Double
indirect would take us up to 65802 blocks. We'd be well into the
triple-indirect. Now, for sequential access, it doesn't matter,
because those triple-indirect blocks will be stored in the block
cache. But for random access... consider that we have 390 blocks full
of indexes to those 100,000 blocks, not even counting the indirects...
and considering that accessing those 100,000 blocks is likely to
displace the earliest-accessed of those 390 blocks of indexes from the
block cache (because the data blocks will be cached, of course -- the
disk algorithms don't know the difference between blocks full of block
numbers, and blocks full of data), thus eliminating any cacheing
effects... you may average 2-3 disk hits in order to access a block of
data, if you access the data randomly enough (note that 4 disk hits is
worst-case, first-time you hit a tail block... but it's quite likely
that the first of those indirect blocks, at least, is going to stick
around in cache for awhile). In other words, a large scale random
access program quite possibly will run at half speed. I'd like it,
too, if someone benchmarked this (we don't happen to have any file
systems handy with 100mb free, or I'd do it myself :-).

>Rather than making files contiguous I would prefer larger blocksizes.
>For example blocksize == tracksize lets a smart disk controller
>eliminate rotational delay.

That works only when you have a file system like the BSD FFS which can
allocate fragments of blocks. Otherwise, most of the space is wasted.
Also note that it makes the job of the fragment allocator much more...
fragmentary.  Contigious files can greatly speed things, in
conjunction with things such as track buffers. For example, allocating
the icons contigiously can more than double the speed of pulling up
icons under AmigaDOS/Intuition. Unfortunately, contigious allocation
greatly increases the complexity of the file system, much more than
BSD's fragment system (since there's a fixed number of 256 byte
fragments in a 4k-8k block). I'd have to think about it for quite
awhile to come up with a contigious allocation scheme that worked,
that was transparant to applications, and, most importantly, had
adequate speed on currently-existing machines. History is filled with
machines that died because of elegant algorithms that sucked up more
CPU time than they were worth (an example of such an algorithm is the
Multics process scheduling algorithm, which took a dozen factors
including phase-of-moon into account when trying to decide what
process to run next ;-).

--
Eric Green  elg@usl.CSNET       P.O. Box 92191, Lafayette, LA 70509
{ihnp4,cbosgd}!killer!elg,      {ut-sally,killer}!usl!elg
  "what exactly is a dream... and what exactly is a joke?"  -- Syd Barrett

atbowler@orchid.waterloo.edu (Alan T. Bowler [SDG]) (01/05/88)

Extent based file systems can be very nice.  If the file is contiguous
file access random or sequential is very fast.  The in memory representation
is very compact.  It is easy to change a program to use larger
buffers, and this will almost certainly has a positive effect on
program and system peformance.  The problems occur when you need to
grow an existing file.  A lot of operating systems have gotten screwed
up at this point and given extent based systems an undeserved bad name.
Thus we have people in this group making incorrect statements
about extent based systems in general, an giving examples from some
badly implemented system. 
So far I have seen the following arguments:
  1) You can't grow the file
     This is not true.  You simply allocate another block and add it
     to the file description.
  2) You can only grow the file "n" times.
     No you simply allocate an extension to the file description
     and put the new extent descriptor there.
So far if you just follow the above rules you will degenerate into
a fixed block scheme like Unix systems have classically used.  Simply
call the file descriptor extensions indirect blocks and you have
essentially the same thing.
  3) You have to specify an initial and maximum size when creating the
     file.
     No, but you can allow the user to do such a thing and be rewarded
     with a performance improvement.
  4) The available space gets fragemented and you can't allocate a
     file because there is not a big enough chunk available.
     No, you of course thry to allocate new files contiguous
     but if you can't simply return the space as a set of chunks.
     The logical to physical mapping is done by the system so
     except for performance the user program can't tell.
The real thing is that a extent based system requires a reasonably
regular file system maintenance proceedure to "compact" multi-extent
files into a single extent, so the user doesn't have to worry about
this.  One approach is to simply dump restore the whole file
system once every few months.  This in a reasonably adequate approach
in most cases, but an even better approach involves running a daemon
or other automatically scheduled program that once a day, or once
a week simply walks through the system and rearranges files
to compact multi-extent files, and arrange coalesce the available space
so that large holes are available.  This program does not have to
do a "perfect" job.  If a file is currently busy, it can simply
skipp it and go on.  It will probably get it the next time.
   There are real advantages to extent based systems, but like
everything else they do not come for free.  In this case there
is a lot of work required to do a "complete" implementation job.
Incomplete extent base systems seem to work for a while, but
a user feel like he has hit a brick wall when he runs into
a restriction imposed by an inncomplete implementation.

fmayhar@killer.UUCP (Frank Mayhar) (01/09/88)

In article <12210@orchid.waterloo.edu> atbowler@orchid.waterloo.edu (Alan T. Bowler [SDG]) writes:
>The real thing is that a extent based system requires a reasonably
>regular file system maintenance proceedure to "compact" multi-extent
>files into a single extent, so the user doesn't have to worry about
>this.  One approach is to simply dump restore the whole file
>system once every few months.  [...]
>   There are real advantages to extent based systems, but like
>everything else they do not come for free.  In this case there
>is a lot of work required to do a "complete" implementation job.
>Incomplete extent base systems seem to work for a while, but
>a user feel like he has hit a brick wall when he runs into
>a restriction imposed by an inncomplete implementation.

Just out of curiousity (I "own" the file management part of an OS that uses
an extent-based file system), exactly what do you mean by a "complete" 
implementation?

The OS I work with (and help maintain) is the Honeywell (Bull) CP-6 Operating
System.  Your remarks (quoted above) regarding file fragmentation are right on
the money in our case.  It's even worse because the only way we have to
effectively de-frag our disks is to do a ful backup/restore, since the level of
abstraction we provide at the user level prevents us from having a program
(that's not actually part of the OS itself) from working with the disk
directly.  Not to mention the fact that having a program moving extents around
would play hell with our disk I/O caching scheme.  (Nothing like having a file
directory page in the middle of a FORTRAN source file. :-)
-- 
Frank Mayhar        UUCP: ..!{ihnp4,dj3b1}!killer!fmayhar
                    ARPA: Frank-Mayhar%ladc@BCO-MULTICS.ARPA
                    USmail: 680 Grand Ave #201 Long Beach, CA 90814
                    DeathStarNet: (213) 438-7899

mac3n@babbage.acc.virginia.edu (Alex Colvin) (01/10/88)

See the latest issue of ACM Transactions on Computer Systems (TOCS).