hamrick@convex.com (Ed Hamrick) (05/28/91)
We recently ported the UniTree file migration package to the CONVEX platform, and ran across some NFS issues. UniTree supports files > 2 GBytes in length, but NFS is limited to files of 2^31-1 bytes (32 bit int field). We've modified UniTree to allow access to large files (> 2 GBytes) using NFS by always reporting a file length less than 2 GBytes, even when the file is larger than this. This allows access to the first 2 GBytes of large files, as well as allowing users to use standard file manipulation commands such as ls -l, mv, rm, etc. Given that there are going to be more and more operating systems which are able to handle files greater than 2 GBytes (Large File Aware, LFA), allowing NFS to use files larger than 2 GBytes will be increasingly important. I've heard that the next revision of the NFS specification may handle large files, but I've also heard that it is a significant modification to the existing NFS spec. I've also heard for a long time that it will be out "Real Soon Now". I also have questions about interoperability with existing NFS implementations. I had a thought the other day that a very minor modification to the existing NFS specification could allow: 1) LFA NFS clients to talk to LFA NFS servers 2) non-LFA NFS clients to talk to LFA NFS servers 3) LFA NFS clients to talk to non-LFA NFS servers 4) non-LFA NFS clients to talk to non-LFA NFS servers There are only three places in NFS that limit files to 2 GBytes: 1) Byte offset into file for read/write (32 bit int) 2) Length of file reported in file status response (32 bit int) 3) Length of file to set file length to (32 bit int) I believe that the following usage of these three fields will allow any combination of LFA and non-LFA NFS client/servers to interoperate. These modifications don't require more than a few dozen lines of code. 1) Most NFS clients read data on block boundaries. If the high bit of the byte offset into file is set (if the int is negative) then treat the lower 31 bits as a block offset into the file. A client should always send a byte offset for all file reads below 2 GBytes to assure interoperability with non-LFA servers. An LFA client trying to write past the 2 GByte mark to a non-LFA server will get an error because the byte offset will appear negative to the non-LFA server. 2) If the length of a file fits into 31 bits, the server should report it. If the file length exceeds 2 GBytes, the block offset (lower 9 bits for a 512 byte block size) should be reported in the lower bits of the file length, and all ones in the upper bits of the file length (the sign bit should be zero). The remaining information about the file length should be reported in the field used for reporting the number of blocks in the file. If this is done, a LFA client can reconstruct the true size of the file by: a) If the top bits of the reported length aren't all ones, then the length field contains the file's true size b) Otherwise, compute the file length from the number of blocks in the file and the lowest N (9) bits of the reported file size 3) The most difficult part of using the existing NFS protocol to handle LFA clients and servers is to allow LFA clients to set the length (truncate) of an existing file. I suggest that this be a two-part process: a) The LFA client computes a file size in bytes and a file size in blocks (rounded down to the nearest block). b) If the file size in bytes is > 2 GBytes, fill in the top N bits except for the sign bit with ones. c) Set the sign bit of the number of blocks d) Send the message to set the file size, using the number of bytes calculated in a) and b). e) When an LFA NFS server receives a file size with all ones in the top N bits (except for the sign bit), use the low M bits (9 for 512 byte block size) for the low M bits of the file size. If the top N bits aren't one, use the file size as indicated. f) If the file size is > 2 GBytes, send the message to set the file size, using the number of blocks computed in a) and c). g) When an LFA NFS server receives a file size with the sign bit set, use the low 31 bits as the number of blocks (rounded down) in the file. Keep the existing low N bits of the file size in the new file size. h) When a non-LFA NFS server receives a file size with the sign bit set, it should be treated as an error and a no-op. I've had the best experiences using NFS block sizes of 512 bytes, as some NFS clients assume this (du utility for instance). An NFS client should use whatever block size the server reports. I'd welcome any feedback, ideas, or criticisms anybody might have. I'd also welcome any other ideas for using NFS in an LFA client/server environment. I'd appreciate any information that anybody might have about what operating systems have file systems that currently support files > 2 GBytes, and what operating systems have announced support in the near future for large files. I know that Cray UNICOS currently supports large files, I think Amdahl UTS has support (but I'm not positive), and ConvexOS will have it Real Soon Now. Please send e-mail and I'll consolidate and summarize. This is cross-posted to comp.arch to solicit information about file systems that support large files. Please follow-up to comp.protocols.nfs. Regards, Ed Hamrick CONVEX Computer Corporation (602) 468-7977 hamrick@convex.com
paulo@soprano.chorus.fr (Paulo Amaral) (05/28/91)
In article <1991May27.170212.18590@convex.com>, hamrick@convex.com (Ed Hamrick) writes:
%% We recently ported the UniTree file migration package to the CONVEX
%% platform, and ran across some NFS issues. UniTree supports files > 2 GBytes
%% in length, but NFS is limited to files of 2^31-1 bytes (32 bit int field).
%%
%% We've modified UniTree to allow access to large files (> 2 GBytes) using NFS
%% by always reporting a file length less than 2 GBytes, even when the file is
%% larger than this. This allows access to the first 2 GBytes of large files,
%% as well as allowing users to use standard file manipulation commands such as
%% ls -l, mv, rm, etc.
%%
...
%%
%% I've had the best experiences using NFS block sizes of 512 bytes, as some
%% NFS clients assume this (du utility for instance). An NFS client should use
%% whatever block size the server reports.
%%
For a 2Gb file, a 512 byte buffer would mean 4M read operations (RPC requests). Assuming 2 ms for each RPC it would take more than 2 hours to read a whole file whereas with 8kb buffers it would take 8 minutes.
--
______
/ /
/_____/_ ___ Paulo Amaral
/ /__/ / / / / / Chorus Systemes
/ / / /__/ /__ /__/ 6 avenue Gustave Eiffel
F-78182, St-Quentin-en-Yvelines Cedex
Tel: +33 (1) 30 64 82 35 Fax: + 33 (1) 30 57 00 66
E-mail: paulo@chorus.fr OR paulo%chorus.fr@mcsun.EU.net
hamrick@convex.com (Ed Hamrick) (05/29/91)
In article <10845@chorus.fr> paulo@soprano.chorus.fr (Paulo Amaral) writes: >For a 2Gb file, a 512 byte buffer would mean 4M read operations (RPC requests). >Assuming 2 ms for each RPC it would take more than 2 hours to read a whole file >whereas with 8kb buffers it would take 8 minutes. You are quite correct that all NFS reads and writes should be in units of 8 KBytes if possible in order to achieve the best possible performance. (Another interesting area of discussion is how to increase this to ~64 KBytes without breaking very much. HIPPI NFS performance would benefit from this.) I was referring to the block size reported in the fattr field blocksize. Things seem to work best when this is set to 512 bytes. I seem to recall that this was done to make the "du" utility work properly since it assumes (in at least one implementation) that the fattr field "blocks" contains the number of 512 byte blocks allocated to a file. Regards, Ed Hamrick
rmilner@zia.aoc.nrao.edu (Ruth Milner) (05/31/91)
In article <1991May28.175809.13532@convex.com> hamrick@convex.com (Ed Hamrick) writes: >(Another interesting area of discussion is how to increase this to ~64 KBytes >without breaking very much. HIPPI NFS performance would benefit from this.) [All of my comments below apply only to applications requiring large amounts of data, on the scale of this thread. Any application which primarily can only use a few K of whatever is sent to it will probably not benefit from using large buffer sizes. Perhaps this could be tunable on a per-mount basis? It already is to some extent, with rsize and wsize.] Yes! Yes! This has three benefits: 1) it would reduce the number of client requests by almost an order of magnitude, 2) throughput on any high-speed network will benefit from larger buffer sizes (Ultranet is another), and 3) by asking the disk for 64K instead of 8K, you can take advantage of physical contiguity on the disk as well as decent buffering in the drive itself. While most current disks do not have as many as 128 sectors on a single track, track- to-track seeks are very fast, and if the filesystem is sensible it will mini- mize latency on this type of seek. Many of the modern disks (e.g. SCSI) have enough internal buffer to hold 2 tracks' worth of data, and will send that out at a burst rate - much higher than the direct-from-disk transfer rate - when the bus is free. Also, since this would reduce the number of requests the server had to respond to, it would require less of the server's attention and improve response time overall to all clients. -- Ruth Milner Systems Manager NRAO/VLA Socorro NM Computing Division Head rmilner@zia.aoc.nrao.edu