[comp.unix.questions] Sparse Files ?

mark@ria-emh2.army.mil (Mark D. McKamey IM SA) (04/27/89)

Hello,
 
     I have just finished compiling fsanalyze, version 4.2 written by 
Mr. Michael Young.  One of the output report fields says: 

	Sparse files      = 0 (0.00%)

     What is the definition of a "Sparse file" in the UNIX world?


--
Mark D. McKamey  -  mark@RIA-EMH2.ARMY.MIL

madd@bu-cs.BU.EDU (Jim Frost) (05/01/89)

In article <19342@adm.BRL.MIL> mark@ria-emh2.army.mil (Mark D. McKamey IM SA) writes:
|     What is the definition of a "Sparse file" in the UNIX world?

UNIX stores data in files by maintaining pointers to data blocks.  By
allocating only those blocks which have actually been written to, you
can create files which appear to be larger than they actually are.
These are usually created by lseek()ing and write()ing.

When you create an empty file, the system allocates a file information
block (called an inode) which contains a small list of block pointers.
This list is initially blank.  When we write into the file, the system
gets data blocks and sets the appropriate block pointer to point to
the block.

When we just create a file we get seomthing like this:

	ptr1 -> null
	ptr2 -> null
	ptr3 -> null

When we write to that file we get something like this:

	ptr1 -> block1
	ptr2 -> null
	ptr3 -> null

We deposit data into block1 until block1 is filled, then get another
block and set ptr2 to point to it.  If instead of just opening and
writing the file you open, seek into the file somewhere, and then
write, you can get something like:

	ptr1 -> null
	ptr2 -> block1
	ptr3 -> null

To the user it looks like he has a two-block file which has one block
of zeros (the system returns zeros for reads into null blocks), but to
the system he has only a one-block file.  This difference can add up
to a considerable savings in some cases.  For the normal case, this
behavior affects nothing.

jim frost
madd@bu-it.bu.edu

scott@rdahp.UUCP (Scott Hammond) (05/02/89)

In article <30481@bu-cs.BU.EDU> madd@bu-it.bu.edu (Jim Frost) writes:
>In article <19342@adm.BRL.MIL> mark@ria-emh2.army.mil (Mark D. McKamey IM SA) writes:
>|     What is the definition of a "Sparse file" in the UNIX world?
>
>[discussion on how it works]

I'm interested in knowing how much UNIX _application_ software (besides
news, mailers, or pathalias) uses sparse files.  In particular, given an
underlying file system implementation which doesn't permit holes, are
there many of situations where a lot of space is going to be wasted by
traditional attempts at creating sparse files? 
--
Scott Hammond,  R & D Associates, Marina del Rey, CA  (213) 822-1715
: {ksuvax1,zardoz,randvax}!rdahp!scott
:  scott@harris.cis.ksu.edu

dg@lakart.UUCP (David Goodenough) (05/02/89)

madd@bu-cs.BU.EDU (Jim Frost) sez:
] mark@ria-emh2.army.mil (Mark D. McKamey IM SA) writes:
] |     What is the definition of a "Sparse file" in the UNIX world?
] 
] UNIX stores data in files by maintaining pointers to data blocks.  By
] allocating only those blocks which have actually been written to, you
] can create files which appear to be larger than they actually are.
] These are usually created by lseek()ing and write()ing.
] 
] When you create an empty file, the system allocates a file information
] block (called an inode) which contains a small list of block pointers.
] This list is initially blank.  When we write into the file, the system
] gets data blocks and sets the appropriate block pointer to point to
] the block.

Hummm - this sounds a bit like CP/M - obviously UNIX stuffs a lot more info
into the inode than CP/M does into it's directory slot, but the method of
a list of block numbers is exactly the same. Now for the $64000 question:
what does UNIX do when it runs out of block number slots in the inode. I
doubt it's the same as CP/M (which just allocates a second directory entry
for the file, and sets a flag to show this is an extension). So how does
UNIX handle very big files?
-- 
	dg@lakart.UUCP - David Goodenough		+---+
						IHS	| +-+-+
	....... !harvard!xait!lakart!dg			+-+-+ |
AKA:	dg%lakart.uucp@xait.xerox.com		  	  +---+

guy@auspex.auspex.com (Guy Harris) (05/04/89)

>Hummm - this sounds a bit like CP/M - obviously UNIX stuffs a lot more info
>into the inode than CP/M does into it's directory slot, but the method of
>a list of block numbers is exactly the same. Now for the $64000 question:
>what does UNIX do when it runs out of block number slots in the inode. I
>doubt it's the same as CP/M (which just allocates a second directory entry
>for the file, and sets a flag to show this is an extension). So how does
>UNIX handle very big files?

In the most common UNIX file systems, namely the V7 one (used by 4.1BSD,
System III, and System V, as well as V7 itself, and derivatives of the
aforementioned systems) and the 4.2BSD one, the first N (N == 10 for V7, N
== 12 for 4.2BSD) slots are "direct" pointers; they contain the block
number of a block in the file (the zeroth one contains the block number
of the zeroth block, etc.).  The N+1st one is an "indirect" pointer; it
contains the block number of a block full of block numbers in the file,
starting with the N+1st block of the file.  The N+2nd one is a
"doubly-indirect" pointer; it contains the block number of a block full
of block numbers of indirect blocks, and the N+3rd one is a
"triply-indirect" block, containing just what you think it would.

An block in the V7 file system is usually either 512 or 1024 bytes, and
a block number is 4 bytes; a block in the 4.2BSD file system is
typically 4K or 8K.  Some versions of both file system can have even
bigger blocks.  Thus, if it runs out of block number slots in the inode,
given that the last slot maps a block full of blocks that map blocks
full of..., your file has gotten pretty big; the V7 and 4.2BSD file
systems just say "enough is enough" and don't let your file get any
bigger.

madd@bu-cs.BU.EDU (Jim Frost) (05/08/89)

In article <228@rdahp.UUCP> scott@rdahp.UUCP (Scott Hammond) writes:
|I'm interested in knowing how much UNIX _application_ software (besides
|news, mailers, or pathalias) uses sparse files.  In particular, given an
|underlying file system implementation which doesn't permit holes, are
|there many of situations where a lot of space is going to be wasted by
|traditional attempts at creating sparse files? 

Not generally, although some common database techniques can create
sparse files (dbm and ndbm packages use techniques which commonly make
sparse files, and dbm/ndbm are often used by applications because
they're already there).  Unless you're dealing with a large database
application, it's unlikely to be a problem, and many databases avoid
sparse files.

In answer to your question, there is not very much application
software which uses sparse files.  Be careful, though -- it only takes
one such application to make your life difficult.

jim frost
madd@bu-it.bu.edu