[comp.arch] Large Files

hamrick@convex.com (Ed Hamrick) (05/28/91)

We recently ported the UniTree file migration package to the CONVEX
platform, and ran across some NFS issues.  UniTree supports files > 2 GBytes
in length, but NFS is limited to files of 2^31-1 bytes (32 bit int field).

We've modified UniTree to allow access to large files (> 2 GBytes) using NFS
by always reporting a file length less than 2 GBytes, even when the file is
larger than this.  This allows access to the first 2 GBytes of large files,
as well as allowing users to use standard file manipulation commands such as
ls -l, mv, rm, etc.

Given that there are going to be more and more operating systems which are
able to handle files greater than 2 GBytes (Large File Aware, LFA), allowing
NFS to use files larger than 2 GBytes will be increasingly important.  I've
heard that the next revision of the NFS specification may handle large files,
but I've also heard that it is a significant modification to the existing
NFS spec.  I've also heard for a long time that it will be out "Real Soon Now".
I also have questions about interoperability with existing NFS implementations.

I had a thought the other day that a very minor modification to the existing
NFS specification could allow:

	1) LFA NFS clients to talk to LFA NFS servers
	2) non-LFA NFS clients to talk to LFA NFS servers
	3) LFA NFS clients to talk to non-LFA NFS servers
	4) non-LFA NFS clients to talk to non-LFA NFS servers

There are only three places in NFS that limit files to 2 GBytes:

	1) Byte offset into file for read/write (32 bit int)
	2) Length of file reported in file status response (32 bit int)
	3) Length of file to set file length to (32 bit int)

I believe that the following usage of these three fields will allow any
combination of LFA and non-LFA NFS client/servers to interoperate.  These
modifications don't require more than a few dozen lines of code.

	1) Most NFS clients read data on block boundaries. If the high
	   bit of the byte offset into file is set (if the int is negative)
	   then treat the lower 31 bits as a block offset into the file.
	   A client should always send a byte offset for all file reads
	   below 2 GBytes to assure interoperability with non-LFA
	   servers.  An LFA client trying to write past the 2 GByte mark
	   to a non-LFA server will get an error because the byte
	   offset will appear negative to the non-LFA server.

	2) If the length of a file fits into 31 bits, the server should
	   report it.  If the file length exceeds 2 GBytes, the block
	   offset (lower 9 bits for a 512 byte block size) should be reported
	   in the lower bits of the file length, and all ones in the upper
	   bits of the file length (the sign bit should be zero).  The
	   remaining information about the file length should be reported
	   in the field used for reporting the number of blocks in the file.
	   If this is done, a LFA client can reconstruct the true size of
	   the file by:

		a) If the top bits of the reported length aren't all ones,
		   then the length field contains the file's true size

		b) Otherwise, compute the file length from the number of
		   blocks in the file and the lowest N (9) bits of the
		   reported file size

	3) The most difficult part of using the existing NFS protocol to
	   handle LFA clients and servers is to allow LFA clients to
	   set the length (truncate) of an existing file.  I suggest that
	   this be a two-part process:

		a) The LFA client computes a file size in bytes and a file
		   size in blocks (rounded down to the nearest block).

		b) If the file size in bytes is > 2 GBytes, fill in the
		   top N bits except for the sign bit with ones.

		c) Set the sign bit of the number of blocks

		d) Send the message to set the file size, using the number
		   of bytes calculated in a) and b).

		e) When an LFA NFS server receives a file size with all ones
		   in the top N bits (except for the sign bit), use the
		   low M bits (9 for 512 byte block size) for the low M bits
		   of the file size.  If the top N bits aren't one, use the file
		   size as indicated.

		f) If the file size is > 2 GBytes, send the message to set the
		   file size, using the number of blocks computed in a) and c).

		g) When an LFA NFS server receives a file size with the sign bit
		   set, use the low 31 bits as the number of blocks (rounded
		   down) in the file.  Keep the existing low N bits of the file
		   size in the new file size.

		h) When a non-LFA NFS server receives a file size with the sign
		   bit set, it should be treated as an error and a no-op.

I've had the best experiences using NFS block sizes of 512 bytes, as some
NFS clients assume this (du utility for instance).  An NFS client should use
whatever block size the server reports.

I'd welcome any feedback, ideas, or criticisms anybody might have.  I'd
also welcome any other ideas for using NFS in an LFA client/server environment.

I'd appreciate any information that anybody might have about what operating
systems have file systems that currently support files > 2 GBytes, and what
operating systems have announced support in the near future for large files.

I know that Cray UNICOS currently supports large files, I think Amdahl UTS
has support (but I'm not positive), and ConvexOS will have it Real Soon Now.
Please send e-mail and I'll consolidate and summarize.

This is cross-posted to comp.arch to solicit information about file systems
that support large files.  Please follow-up to comp.protocols.nfs.

Regards,
Ed Hamrick
CONVEX Computer Corporation
(602) 468-7977
hamrick@convex.com

paulo@soprano.chorus.fr (Paulo Amaral) (05/28/91)

In article <1991May27.170212.18590@convex.com>, hamrick@convex.com (Ed Hamrick) writes:
%% We recently ported the UniTree file migration package to the CONVEX
%% platform, and ran across some NFS issues.  UniTree supports files > 2 GBytes
%% in length, but NFS is limited to files of 2^31-1 bytes (32 bit int field).
%% 
%% We've modified UniTree to allow access to large files (> 2 GBytes) using NFS
%% by always reporting a file length less than 2 GBytes, even when the file is
%% larger than this.  This allows access to the first 2 GBytes of large files,
%% as well as allowing users to use standard file manipulation commands such as
%% ls -l, mv, rm, etc.
%% 
...
%% 
%% I've had the best experiences using NFS block sizes of 512 bytes, as some
%% NFS clients assume this (du utility for instance).  An NFS client should use
%% whatever block size the server reports.
%% 

For a 2Gb file, a 512 byte buffer would mean 4M read operations (RPC requests). Assuming 2 ms for each RPC it would take more than 2 hours to read a whole file whereas with 8kb buffers it would take 8 minutes.
--
        ______
       /     /                          
      /_____/_           ___         Paulo Amaral
     /    /__/ /  / /   /  /         Chorus Systemes 
    /    /  / /__/ /__ /__/          6 avenue Gustave Eiffel 
                                     F-78182, St-Quentin-en-Yvelines Cedex  
Tel: +33 (1) 30 64 82 35               Fax: + 33 (1) 30 57 00 66
E-mail: paulo@chorus.fr              OR paulo%chorus.fr@mcsun.EU.net

hamrick@convex.com (Ed Hamrick) (05/29/91)

In article <10845@chorus.fr> paulo@soprano.chorus.fr (Paulo Amaral) writes:
>For a 2Gb file, a 512 byte buffer would mean 4M read operations (RPC requests).
>Assuming 2 ms for each RPC it would take more than 2 hours to read a whole file
>whereas with 8kb buffers it would take 8 minutes.

You are quite correct that all NFS reads and writes should be in units of
8 KBytes if possible in order to achieve the best possible performance.
(Another interesting area of discussion is how to increase this to ~64 KBytes
without breaking very much.  HIPPI NFS performance would benefit from this.)

I was referring to the block size reported in the fattr field blocksize.
Things seem to work best when this is set to 512 bytes.  I seem to recall that
this was done to make the "du" utility work properly since it assumes
(in at least one implementation) that the fattr field "blocks" contains the
number of 512 byte blocks allocated to a file.

Regards,
Ed Hamrick

tbray@watsol.waterloo.edu (Tim Bray) (05/29/91)

hamrick@convex.com (Ed Hamrick) writes:
 We recently ported the UniTree file migration package to the CONVEX
 platform, and ran across some NFS issues.  UniTree supports files > 2 GBytes
 in length...

How many unixes out there *really* support files > 2Gb?  From the application
programmer's point of view, it seems that all one need do is change off_t
in <sys/types.h> to be a signed > 32-bit quantity.  

I would assume Cray & Convex, anyone else?  A future R/4000 OS?

cheers, Tim Bray, Open Text Systems

PS: I wonder what proportion of applications can tolerate 
    (sizeof(int) != sizeof(off_t))?

sef@kithrup.COM (Sean Eric Fagan) (05/30/91)

In article <1991May29.130418.26097@watdragon.waterloo.edu> tbray@watsol.waterloo.edu (Tim Bray) writes:
>PS: I wonder what proportion of applications can tolerate 
>    (sizeof(int) != sizeof(off_t))?

A fair number of them.  16-bit Intel machines have sizeof(int) <
sizeof(off_t).  Generally.

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

rmilner@zia.aoc.nrao.edu (Ruth Milner) (05/31/91)

In article <1991May28.175809.13532@convex.com> hamrick@convex.com (Ed Hamrick) writes:
>(Another interesting area of discussion is how to increase this to ~64 KBytes
>without breaking very much.  HIPPI NFS performance would benefit from this.)

[All of my comments below apply only to applications requiring large amounts
of data, on the scale of this thread. Any application which primarily can only
use a few K of whatever is sent to it will probably not benefit from using
large buffer sizes. Perhaps this could be tunable on a per-mount basis? It
already is to some extent, with rsize and wsize.]

Yes! Yes! This has three benefits: 1) it would reduce the number of client
requests by almost an order of magnitude, 2) throughput on any high-speed 
network will benefit from larger buffer sizes (Ultranet is another), and 3) by 
asking the disk for 64K instead of 8K, you can take advantage of physical 
contiguity on the disk as well as decent buffering in the drive itself. While 
most current disks do not have as many as 128 sectors on a single track, track-
to-track seeks are very fast, and if the filesystem is sensible it will mini-
mize latency on this type of seek. Many of the modern disks (e.g. SCSI) have
enough internal buffer to hold 2 tracks' worth of data, and will send that
out at a burst rate - much higher than the direct-from-disk transfer rate - 
when the bus is free.

Also, since this would reduce the number of requests the server had to respond 
to, it would require less of the server's attention and improve response time
overall to all clients.
-- 
Ruth Milner
Systems Manager                     NRAO/VLA                  Socorro NM
Computing Division Head      rmilner@zia.aoc.nrao.edu