golds@fjc.GOV (Rich Goldschmidt) (05/28/91)
I posted a request here for info about two weeks ago with no response, so I'll try again. The original query was about how file server performance was measured. What do claims of "X" users supported really mean, and how are those measurements made. And how do NFS and Novell compare on the same hardware, like a 386 or 486 running either Novell 3.11 or Interactive with NFS. There must be people out there who know answers to at least some of these questions! And just to broaden the scope, what are peoples experiences out there with Interactive's NFS. I have observed that the documentation has errors and is incomplete, and the authentication doesn't work the way it ought to. What is of greater concern is I am seeing slower NFS file service performance than I had expected, and getting what I think are a lot of NFS errors. -- Rich Goldschmidt: uunet!fjcp60!golds or golds@fjc.gov Commercialization of space is the best way to escape the zero-sum economy. Disclaimer: I don't speak for the government, and it doesn't speak for me...
kdenning@genesis.Naitc.Com (Karl Denninger) (05/28/91)
In article <427@fjcp60.GOV> golds@fjc.GOV (Rich Goldschmidt) writes: >hardware, like a 386 or 486 running either Novell 3.11 or Interactive with >NFS. There must be people out there who know answers to at least some of >these questions! Don't even bother comparing these. They aren't comparable. Compare Novell on a 386 to a REAL machine, like a MIPS Magnum. These are about the same price point once you buy the Novell software and the 386 machine. With a REAL server, NFS to a PC running B&W's package is faster than Novell, both under single-user and multiuser loads. >And just to broaden the scope, what are peoples experiences out there with >Interactive's NFS. I have observed that the documentation has errors and is >incomplete, and the authentication doesn't work the way it ought to. What is >of greater concern is I am seeing slower NFS file service performance than I >had expected, and getting what I think are a lot of NFS errors. ISC's NFS is horrid. The best I've seen on writes is about 30k/sec on a fast '386 machine. The CPU is not saturated, nor is the Ethernet card. The problem appears to be in either the Streams modules or the TCP and NFS code itself. ISC's NFS is also an older revision, and doesn't support root-remapping on a mount-point basis. Blech. -- Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285 kdenning@nis.naitc.com "The most dangerous command on any computer is the carriage return." Disclaimer: The opinions here are solely mine and may or may not reflect those of the company.
rbraun@spdcc.COM (Rich Braun) (05/29/91)
kdenning@genesis.Naitc.Com (Karl Denninger) writes: >ISC's NFS is horrid. The best I've seen on writes is about 30k/sec on a >fast '386 machine. The CPU is not saturated, nor is the Ethernet card. SOSS is almost that fast, running on an 8-bit card with a 20-MHz 386 and talking to a Novell server through a second network hop, using 1K RPC writes. ISC can't be *that* bad, can it? -rich
alex@mks.com (Alex White) (05/30/91)
In article <7678@spdcc.SPDCC.COM> rbraun@spdcc.COM (Rich Braun) writes: >kdenning@genesis.Naitc.Com (Karl Denninger) writes: >>ISC's NFS is horrid. The best I've seen on writes is about 30k/sec on a >>fast '386 machine. The CPU is not saturated, nor is the Ethernet card. > >SOSS is almost that fast, running on an 8-bit card with a 20-MHz 386 >and talking to a Novell server through a second network hop, using 1K >RPC writes. ISC can't be *that* bad, can it? > >-rich Can't be *that* bad? Sure can! However, I've found that the CPU is indeed saturated. Thats on a 33Mhz 386 with a 16bit ethernet card. Mind you, I was also trying to figure out why backing up over the network (rsh ISCmachine 'find | cpio' >/dev/rmt0) and after breaking it all down, found even the find by itself with no network traffic used a good healthy hunk of the cpu. Personally, I think ISC just has a horridly slow nami(). Does anybody know if any of the fancy dandy ethernet cards with tcp/ip on them have any kind of driver that would work? And would they end up faster? [If the problem is in streams or something, maybe; but if its in nami() not a hope!]
kdenning@genesis.Naitc.Com (Karl Denninger) (05/30/91)
In article <7678@spdcc.SPDCC.COM> rbraun@spdcc.COM (Rich Braun) writes: >kdenning@genesis.Naitc.Com (Karl Denninger) writes: >>ISC's NFS is horrid. The best I've seen on writes is about 30k/sec on a >>fast '386 machine. The CPU is not saturated, nor is the Ethernet card. > >SOSS is almost that fast, running on an 8-bit card with a 20-MHz 386 >and talking to a Novell server through a second network hop, using 1K >RPC writes. ISC can't be *that* bad, can it? It will also hang the TCP stack entirely if you mount a ISC disk from a Sun (or other real workstation) with 8k read/write block sizes and blast a few hundred K of data back and forth. If you run 1k blocks, it works. VERY slowly. PCs can't load it heavily enough to crash it, but they run real slow. Yes, ISC's NFS is really that bad (This is 2.2 with or without their "update" installed). I'd love a REAL NFS implementation on top of a Real Unix for the 386. It doesn't exist to the best of my knowledge. -- Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285 kdenning@nis.naitc.com "The most dangerous command on any computer is the carriage return." Disclaimer: The opinions here are solely mine and may or may not reflect those of the company.
geoff@hinode.East.Sun.COM (Geoff Arnold @ Sun BOS - R.H. coast near the top) (05/30/91)
Quoth kdenning@genesis.Naitc.Com (Karl Denninger) (in <1991May30.045522.6246@Firewall.Nielsen.Com>): #I'd love a REAL NFS implementation on top of a Real Unix for the 386. It #doesn't exist to the best of my knowledge. Not even SVR4...? Maybe the folks on comp.unix.sysv386 have some comparative numbers for ISC and their favourite brand of SVR4. -- Geoff Arnold, PC-NFS architect, Sun Microsystems. (geoff@East.Sun.COM) -- ------------------------------------------------------------------------------ -- Sun Microsystems PC Distributed Systems ... -- -- ... soon to be a part of SunTech (stay tuned for details) --
larryp@sco.COM (Larry Philps) (05/30/91)
In <1991May30.021412.22925@mks.com> alex@mks.com (Alex White) writes: > > Does anybody know if any of the fancy dandy ethernet cards with tcp/ip > on them have any kind of driver that would work? There are a few out there. The entire Excelan line (now being sold by Federal Technologies), Interlan, and probably a couple more. > And would they end up faster? Typically this is a slower solution than host based protocols. Why? Because in order to keep the price down, the on board processor, the one that runs your TCP/IP stack, is slow. Typically a 6-12 Mhz 80186. This is a lot slower than a 25+ Mhz 80486/R3000/SPARC/RS6000/PA-RISC/... when it comes to fiddling bits and computing a checksum! In general, there is no reason why this architecture cannot give you equal or better performance that a host-based scheme, but you have to put a *real* cpu on the network board. If you find one, expect to pay big bucks. --- Larry Philps, SCO Canada, Inc. Postman: 130 Bloor St. West, 10th floor, Toronto, Ontario. M5S 1N5 InterNet: larryp@sco.COM or larryp%scocan@uunet.uu.net UUCP: {uunet,utcsri,sco}!scocan!larryp Phone: (416) 922-1937
mikef@leland.Stanford.EDU (Michael Fallavollita) (05/30/91)
I believe Intel's Unix Rel. 4 has a full implementation of NFS. o o \\ ----- || Mike Fallavollita /_______\|| |( O O )| fallavol@corvus.arc.nasa.gov [128.102.24.98] | --^-- | mikef@leland.stanford.edu [36.21.0.69] \ - / ----- -- o o \\ ----- || Mike Fallavollita
ian@unipalm.uucp (Ian Phillipps) (05/30/91)
rbraun@spdcc.COM (Rich Braun) writes: >kdenning@genesis.Naitc.Com (Karl Denninger) writes: >>ISC's NFS is horrid. The best I've seen on writes is about 30k/sec on a >>fast '386 machine. The CPU is not saturated, nor is the Ethernet card. >SOSS is almost that fast, running on an 8-bit card with a 20-MHz 386 >and talking to a Novell server through a second network hop, using 1K >RPC writes. ISC can't be *that* bad, can it? No: I just copied /unix from an ISC 386 to a Sun 3/50 (yeah!) in 6 seconds elapsed; thats approx 120k/second. Ian
brian@telebit.com (Brian Lloyd) (05/31/91)
There exists an nfsstone benchmark program. I am not sure where it came from but I will look into it. I suspect that it came from comp.sources.* and can probably be found on uunet. -- Brian Lloyd, WB6RQN Telebit Corporation Network Systems Architect 1315 Chesapeake Terrace brian@napa.telebit.com Sunnyvale, CA 94089-1100 voice (408) 745-3103 FAX (408) 734-3333
kdenning@genesis.Naitc.Com (Karl Denninger) (05/31/91)
In article <1991May30.165457.26093@unipalm.uucp> ian@unipalm.uucp (Ian Phillipps) writes: >rbraun@spdcc.COM (Rich Braun) writes: > >>kdenning@genesis.Naitc.Com (Karl Denninger) writes: >>>ISC's NFS is horrid. The best I've seen on writes is about 30k/sec on a >>>fast '386 machine. The CPU is not saturated, nor is the Ethernet card. > >>SOSS is almost that fast, running on an 8-bit card with a 20-MHz 386 >>and talking to a Novell server through a second network hop, using 1K >>RPC writes. ISC can't be *that* bad, can it? > >No: I just copied /unix from an ISC 386 to a Sun 3/50 (yeah!) in 6 seconds >elapsed; thats approx 120k/second. Note the direction here. You were READING from the ISC machine. Now try to copy it the other direction. You'll either (1) hang TCP/IP, if you have the default buffer size, or (2) get horrible throughput if you're using the 1k block size. The Sun will swamp the ISC machine's protocol stack, and blow it sky high with 8k (actually, anything more than 1k) block size. -- Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285 kdenning@nis.naitc.com "The most dangerous command on any computer is the carriage return." Disclaimer: The opinions here are solely mine and may or may not reflect those of the company.
ian@unipalm.uucp (Ian Phillipps) (06/01/91)
kdenning@genesis.Naitc.Com (Karl Denninger) writes: >In article <1991May30.165457.26093@unipalm.uucp> ian@unipalm.uucp (Ian Phillipps) writes: >>rbraun@spdcc.COM (Rich Braun) writes: >> >>>kdenning@genesis.Naitc.Com (Karl Denninger) writes: >>>>ISC's NFS is horrid. The best I've seen on writes is about 30k/sec on a >>>>fast '386 machine. The CPU is not saturated, nor is the Ethernet card. >> >> >>No: I just copied /unix from an ISC 386 to a Sun 3/50 (yeah!) in 6 seconds >>elapsed; thats approx 120k/second. >Note the direction here. >You were READING from the ISC machine. >Now try to copy it the other direction. You'll either (1) hang TCP/IP, >if you have the default buffer size, or (2) get horrible throughput if >you're using the 1k block size. Sorry - not reading carefully enough. I just tried it - 25k/second, using the NFS server on the ISC. Not too sure of the block size settings, but it tallies with the performance given above. >The Sun will swamp the ISC machine's protocol stack, and blow it sky high >with 8k (actually, anything more than 1k) block size. We have a very slow Sun...
niklas@appli.se (Niklas Hallqvist) (06/02/91)
kdenning@genesis.Naitc.Com (Karl Denninger) writes: >ISC's NFS is horrid. The best I've seen on writes is about 30k/sec on a >fast '386 machine. The CPU is not saturated, nor is the Ethernet card. The >problem appears to be in either the Streams modules or the TCP and NFS code >itself. >ISC's NFS is also an older revision, and doesn't support root-remapping on a >mount-point basis. >Blech. These figures chocked me and I didn't believe it first, so I had to check them. Truely enough, I got only 25k/sec on transfer from a SCO Unix to an ISC 2.0.2. The other way around it was 200k/sec, that's an order of magnitude involved here! What startles me is that I think Lachman has written the NFS implementation for both SCO and ISC, but that might be wrong. But if it is not, I think ISC should get in touch with Lachman and talk about an upgrade of ISC's NFS. If that will not happen, our company seriously has to reconsider the choice of ISC over SCO, as I'm sure others will too, now that this fact has come to the net's attention. This problem would've caused me lots of performance problems very soon, if I had not seen this message, thanks Karl. Yeah, I know I should've done some benchmarking before deciding OS on our short-to-be backup-server, but lazyness is one of my great virtues (at least Larry Wall thinks that's a virtue...). Niklas -- Niklas Hallqvist Phone: +46-(0)31-40 75 00 Applitron Datasystem Fax: +46-(0)31-83 39 50 Molndalsvagen 95 Email: niklas@appli.se S-412 63 GOTEBORG, Sweden mcsun!sunic!chalmers!appli!niklas
r_hockey@fennel.cc.uwa.oz.au (06/12/91)
Has anyone seen this happen, We have a PC-NFS network with about 50 PCs served by a SUN 3/160 when a user (using a 386-25) indexes a large file with dbase III+ with both the dbase and index file on the same mounted drive the system comes to standstill. When we check the system we find all 8 nfsd's have completely taken over the whole system. Richard Hockey Public Health UDM University of Western Australia NEDLANDS WA 6009
lm@slovax.Eng.Sun.COM (Larry McVoy) (06/13/91)
niklas@appli.se (Niklas Hallqvist) writes: > Truely enough, I got only 25k/sec on transfer from a SCO Unix to an ISC 2.0.2. > The other way around it was 200k/sec, that's an order of magnitude involved here! I suspect you are running into the following NFSism: all writes to an NFS server are turned into sync writes on the server (like you opened with O_SYNC). This is very slow, large transfers can be 3x or more slower on writes than reads. The reason for this has to do with NFS' stateless nature - it can't ACK the write until the data is safe; otherwise the server could crash and the client would lose data. > What startles me is that I think Lachman has written the NFS implementation for > both SCO and ISC, but that might be wrong. Lachman didn't write either - they ported them. --- Larry McVoy, Sun Microsystems (415) 336-7627 ...!sun!lm or lm@sun.com
kdenning@genesis.Naitc.Com (Karl Denninger) (06/13/91)
In article <623@appserv.Eng.Sun.COM> lm@slovax.Eng.Sun.COM (Larry McVoy) writes: >niklas@appli.se (Niklas Hallqvist) writes: >> Truely enough, I got only 25k/sec on transfer from a SCO Unix to an ISC 2.0.2. >> The other way around it was 200k/sec, that's an order of magnitude involved here! > >I suspect you are running into the following NFSism: all writes to an NFS >server are turned into sync writes on the server (like you opened with >O_SYNC). This is very slow, large transfers can be 3x or more slower on >writes than reads. > >The reason for this has to do with NFS' stateless nature - it can't ACK >the write until the data is safe; otherwise the server could crash and the >client would lose data. The interesting thing is, there is little or no disk activity going on (from a look at the wait I/O times and queues)..... on a Sun, on the other hand, there IS a lot of disk activity during an NFS write operation. The MIPS systems I've used don't suffer from this problem. I don't quite understand the fanatacism with which people preach the NFS stateless nature, O_SYNC and all that. The fact is that a crash of a LOCAL Unix machine with the normal block buffering scheme can easily cause the loss of data -- in this case, the write(2) call returned "ok" but it really might not be "OK"! This is true whether the problem is later found to be a bad disk sector, the machine panicing, or any one of a number of other causes. Normal disk I/O on Unix machines is NOT reliable enough to say "if you get a good return from write(), the data is safely on disk". If you WANT reliable I/O, you open with O_SYNC and take the performance hit. Why wasn't this option designed into NFS? It could have been set up so that for Non-Unix clients (which expect reliable I/O and don't have a "buffer cache" that can be disabled on a file I/O basis) default to O_SYNC mode... this is easily handled by making the Unix "open()" hook set the "no sync" flag... Or was this a short-cut that has just never been repaired? -- Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285 kdenning@nis.naitc.com "The most dangerous command on any computer is the carriage return." Disclaimer: The opinions here are solely mine and may or may not reflect those of the company. -- Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285 kdenning@nis.naitc.com "The most dangerous command on any computer is the carriage return." Disclaimer: The opinions here are solely mine and may or may not reflect those of the company.
lm@slovax.Eng.Sun.COM (Larry McVoy) (06/14/91)
kdenning@genesis.Naitc.Com (Karl Denninger) writes: > >The reason for this has to do with NFS' stateless nature - it can't ACK > >the write until the data is safe; otherwise the server could crash and the > >client would lose data. > > The interesting thing is, there is little or no disk activity going on (from > a look at the wait I/O times and queues)..... on a Sun, on the other hand, > there IS a lot of disk activity during an NFS write operation. > > The MIPS systems I've used don't suffer from this problem. > > I don't quite understand the fanatacism with which people preach the NFS > stateless nature, O_SYNC and all that. The fact is that a crash of a > LOCAL Unix machine with the normal block buffering scheme can easily cause > the loss of data -- in this case, the write(2) call returned "ok" but it > really might not be "OK"! This is true whether the problem is later found > to be a bad disk sector, the machine panicing, or any one of a number of > other causes. Normal disk I/O on Unix machines is NOT reliable enough to > say "if you get a good return from write(), the data is safely on disk". NFS is stateless. The reason for this statelessness is so that a client does not need to do anything special when a server goes down. A dead server looks just like a slow server to a client. A client issues a write, the server ACKs the write. What does that ACK mean? It means that the client data is safe. The client kernel may throw away the data, the server has promised that the data can be retrieved. If the server ACKs the data before writing it to disk, there is a window during which the server can crash. The data is then lost. MIPS systems have an unsafe export option that allows you to turn off this constraint - big performance win, big safety lose. There are other ways to address this problem without breaking the semantics of NFS. One such way is to buffer the writes in NVRAM. --- Larry McVoy, Sun Microsystems (415) 336-7627 ...!sun!lm or lm@sun.com
kdenning@genesis.Naitc.Com (Karl Denninger) (06/14/91)
In article <625@appserv.Eng.Sun.COM> lm@slovax.Eng.Sun.COM (Larry McVoy) writes: >kdenning@genesis.Naitc.Com (Karl Denninger) writes: >> >> I don't quite understand the fanatacism with which people preach the NFS >> stateless nature, O_SYNC and all that. The fact is that a crash of a >> LOCAL Unix machine with the normal block buffering scheme can easily cause >> the loss of data -- in this case, the write(2) call returned "ok" but it >> really might not be "OK"! This is true whether the problem is later found >> to be a bad disk sector, the machine panicing, or any one of a number of >> other causes. Normal disk I/O on Unix machines is NOT reliable enough to >> say "if you get a good return from write(), the data is safely on disk". > >NFS is stateless. The reason for this statelessness is so that a client >does not need to do anything special when a server goes down. A dead >server looks just like a slow server to a client. So far, so good. >A client issues a write, the server ACKs the write. What does that ACK >mean? It means that the client data is safe. The client kernel may >throw away the data, the server has promised that the data can be >retrieved. > >If the server ACKs the data before writing it to disk, there is a window >during which the server can crash. The data is then lost. How does this differ from the standard "Unix" way of doing file I/O, which returns a successful reply from a write call before the data is safely on disk? If you write data, get back a "ACK" (or good return value) the data isn't necessarially on disk -- it could be in the buffer cache. If the machine crashes before the data is flushed you lose. I can't see how this is any different than ACKing packets from NFS clients when you haven't actually written them any further than the buffer cache (exactly the same as the standard Unix semantics). You have the same risks if the server (the machine with the disk on it :-) crashes as you would with a local workstation or server drive. In both cases data can be lost. >MIPS systems have an unsafe export option that allows you to turn off >this constraint - big performance win, big safety lose. There is no export option in the manual pages for RiscOS 4.51 which addresses what you're talking about. I just checked again; it's not there. >There are other ways to address this problem without breaking the >semantics of NFS. One such way is to buffer the writes in NVRAM. Like Legato's PrestoServe. Yes, I know. That is not completely safe either. You could have "something happen" to the presto board -- and your data would be lost. The point is that standard Unix machines often say "your data is safe" when it really isn't. In fact, ALL systems by the virture of the fact that hardware can fail make this assumption. I don't see what you buy by having the default for NFS transactions be more "safe" than a local disk drive -- other than making recovery from crashes simple for the client side. I would think that one of the easiest ways to address this would be to allow an option to have "safe" or "unsafe" writes on a per-mount basis. This allows the user to choose his level of performance and risk, and make his/her own choice. I'd be for that. -- Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285 kdenning@nis.naitc.com "The most dangerous command on any computer is the carriage return." Disclaimer: The opinions here are solely mine and may or may not reflect those of the company.
k2@bl.physik.tu-muenchen.de (Klaus Steinberger) (06/14/91)
kdenning@genesis.Naitc.Com (Karl Denninger) writes: >>MIPS systems have an unsafe export option that allows you to turn off >>this constraint - big performance win, big safety lose. >There is no export option in the manual pages for RiscOS 4.51 which >addresses what you're talking about. I just checked again; it's not there. It's not in the exports file, it's in fstab, her is the relevant part of the manpage (FSTAB(SysV)): nfs_sync|nfs_async Typically, NFS performs synchronous writes; however, performing asynchronous writes can speed-up performance tremendously. This option allows nfs servers to control their local file system behavior for NFS write requests. Synchronous writes guarantee the data has been written to disk rather than guaranteeing that a server has correctly received a write request. This only becomes an issue if a server crashes during a write. See kopt (8) to change the global default value. nfs_async is the current default. As you can see, the default for the server is ASYNC. You can change the behaviour on a per filesystem basis at mount time, or a system startup globally with kopt. We use the default of async, and we are happy. Sincerely, Klaus Steinberger -- Klaus Steinberger Beschleunigerlabor der TU und LMU Muenchen Phone: (+49 89)3209 4287 Hochschulgelaende FAX: (+49 89)3209 4280 D-8046 Garching, Germany BITNET: K2@DGABLG5P Internet: k2@bl.physik.tu-muenchen.de
thurlow@convex.com (Robert Thurlow) (06/14/91)
In <1991Jun13.164017.29944@Firewall.Nielsen.Com> kdenning@genesis.Naitc.Com (Karl Denninger) writes: >Why wasn't [O_SYNC] designed into NFS? Um, we picked this up from the NFSSRC 4.0 release. Pages get written to the buffer cache so they can be read next time, but your write(2) blocks until all I/O is completed on those pages. If Sun has it, and Sun licensees have it, in what sense is more needed? If your vendor doesn't do this, and does have a buffer cache and biods, just ask them for the option. Rob T -- Rob Thurlow, thurlow@convex.com An employee and not a spokesman for Convex Computer Corp., Dallas, TX
droms@regulus.bucknell.edu (Ralph E. Droms) (06/14/91)
In article <1991Jun13.234448.16172@Firewall.Nielsen.Com> kdenning@genesis.Naitc.Com (Karl Denninger) writes: > >If the server ACKs the data before writing it to disk, there is a window >during which the server can crash. The data is then lost. How does this differ from the standard "Unix" way of doing file I/O, which returns a successful reply from a write call before the data is safely on disk? If you write data, get back a "ACK" (or good return value) the data isn't necessarially on disk -- it could be in the buffer cache. If the machine crashes before the data is flushed you lose. I think the difference lies in the feedback to the user. If the local UNIX box crashes, the user is aware "something is wrong" immediately. If the server crashes and reboots, the data can be lost silently... -- - Ralph Droms Computer Science Department droms@bucknell.edu 323 Dana Engineering Bucknell University (717) 524-1145 Lewisburg, PA 17837
geoff@hinode.East.Sun.COM (Geoff Arnold @ Sun BOS - R.H. coast near the top) (06/14/91)
Quoth droms@bucknell.edu (in <DROMS.91Jun14092449@regulus.bucknell.edu>):
#In article <1991Jun13.234448.16172@Firewall.Nielsen.Com> kdenning@genesis.Naitc.Com (Karl Denninger) writes:
#
# >
# >If the server ACKs the data before writing it to disk, there is a window
# >during which the server can crash. The data is then lost.
#
# How does this differ from the standard "Unix" way of doing file I/O, which
# returns a successful reply from a write call before the data is safely on
# disk? If you write data, get back a "ACK" (or good return value) the data
# isn't necessarially on disk -- it could be in the buffer cache. If the
# machine crashes before the data is flushed you lose.
#
#I think the difference lies in the feedback to the user. If the local
#UNIX box crashes, the user is aware "something is wrong" immediately.
#If the server crashes and reboots, the data can be lost silently...
It's more than simply a vague "feedback to the user": it's a
question of what assertions can be made about the correctness
of file system operations. Even though normal buffer cache
operations can reorder some kinds of operation, I can code something
like
write(file1, data1)
fsync(file1)
write(file2, "file1 was written successfully")
(with appropriate error checking) and be confident that file2 will
be written if and only if file1 was written. Karl's "standard Unix way"
doesn't apply here: if the machine crashes, the process will crash
with it. If an NFS server could ack the first write (but not
commit it to stable storage), then crash and reboot, the failure
of the write would be undetectable.
The decision as to whether data should be written "safely" or not should
logically rest with the client, not the server. This is why the
hack of an async server side configuration option is so dangerous.
The correct approach, of course, is the (unimplemented) RFS_WRITECACHE
NFS function.... >sigh< But for now, Prestoserve is the best solution.
--Geoff Arnold, PC-NFS architect(geoff@East.Sun.COM or geoff.arnold@Sun.COM)--
------------------------------------------------------------------------------
-- Sun Technology Enterprises : PC Networking group --
-- (officially from July 1, but effectively in place right now) --
thurlow@convex.com (Robert Thurlow) (06/14/91)
In <1991Jun13.234448.16172@Firewall.Nielsen.Com> kdenning@genesis.Naitc.Com (Karl Denninger) writes: >I can't see how this is any different than ACKing packets from NFS clients >when you haven't actually written them any further than the buffer cache >(exactly the same as the standard Unix semantics). You have the same risks >if the server (the machine with the disk on it :-) crashes as you would with >a local workstation or server drive. In both cases data can be lost. No you _don't_ have the same risks; you have a lot more points of failure, like someone turning off your server and physically removing it from your network, for example. With NFS, you've taken a pretty reliable disk I/O subsystem and put the disk maybe very far away, with lots of failure points you didn't have before, and with other processes able to alter the data outside of either your control or awareness. To some degree, though, you're still expecting it to obey perfect Unix filesystem semantics. It just ain't gonna work that way (though if Sun fixed some of the protocol bugs in NFS, it'd be better). An implementor needs to think harder to get NFS to do The Right Thing. Take this for an example: you're doing a 1K write to a filesystem with an 8K blocksize, so you need to do a read/edit/write of a whole block. What happens when the initial read tells you that the file has changed, and that you should flush everything you know about the file out of your buffer cache? How do you hang onto the data you were trying to write? That isn't a problem over UFS. >>MIPS systems have an unsafe export option that allows you to turn off >>this constraint - big performance win, big safety lose. >There is no export option in the manual pages for RiscOS 4.51 which >addresses what you're talking about. I just checked again; it's not there. Yup, we do this too, but we make a discouraging noise about it, and it isn't the default like it is on Silicon Graphics machines. It's worth the kick for some things, though; my Sun (running 4.1.1) often doesn't survive a server crash, so an 'unsafe' swap might be okay. >I would think that one of the easiest ways to address this would be to allow >an option to have "safe" or "unsafe" writes on a per-mount basis. This >allows the user to choose his level of performance and risk, and make >his/her own choice. I'd be for that. It would make a good mount option, I agree. Having a global decision about such things made for you sucks. Rob T -- Rob Thurlow, thurlow@convex.com An employee and not a spokesman for Convex Computer Corp., Dallas, TX
pcg@aber.ac.uk (Piercarlo Grandi) (06/15/91)
On 12 Jun 91 03:00:04 GMT, r_hockey@fennel.cc.uwa.oz.au said: hockey> Has anyone seen this happen, We have a PC-NFS network with about hockey> 50 PCs served by a SUN 3/160 when a user (using a 386-25) hockey> indexes a large file with dbase III+ with both the dbase and hockey> index file on the same mounted drive the system comes to hockey> standstill. When we check the system we find all 8 nfsd's have hockey> completely taken over the whole system. It's a well known phenomenon with early releases of SunOS 4. 8 nfsds thrash the MMU/cache of a sun 3. This cache has exactly 8 slots; if only the 8 nfsds are active, everything is fine; as soon as another process starts to run, you have 9 processes scheduled round robin, and the MMU/cache is managed LRU. Every context switch brings a cache reload, which takes about 1ms on the machine you mention... and so on. Just reduce the number of nfsds to 4... The problem will go away, and you don't really lose IO performance between 8 and 4 nfsds, even if there are people who argue to the contrary. Try it. Or you could buy a Sun 4 which has many more MMU/cache contexts and runs a newer release of the OS in which the problem has been worked around a bit. -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk
kdenning@genesis.Naitc.Com (Karl Denninger) (06/15/91)
In article <6743@eastapps.East.Sun.COM> geoff@east.sun.com (Geoff Arnold @ Sun BOS - R.H. coast near the top) writes: >Quoth droms@bucknell.edu (in <DROMS.91Jun14092449@regulus.bucknell.edu>): >#In article <1991Jun13.234448.16172@Firewall.Nielsen.Com> kdenning@genesis.Naitc.Com (Karl Denninger) writes: ># ># > ># >If the server ACKs the data before writing it to disk, there is a window ># >during which the server can crash. The data is then lost. ># ># How does this differ from the standard "Unix" way of doing file I/O, which ># returns a successful reply from a write call before the data is safely on ># disk? ..... ># >#I think the difference lies in the feedback to the user. If the local >#UNIX box crashes, the user is aware "something is wrong" immediately. >#If the server crashes and reboots, the data can be lost silently... > >It's more than simply a vague "feedback to the user": it's a >question of what assertions can be made about the correctness >of file system operations. Even though normal buffer cache >operations can reorder some kinds of operation, I can code something >like > > write(file1, data1) > fsync(file1) > write(file2, "file1 was written successfully") > >(with appropriate error checking) and be confident that file2 will >be written if and only if file1 was written. Karl's "standard Unix way" >doesn't apply here: if the machine crashes, the process will crash >with it. If an NFS server could ack the first write (but not >commit it to stable storage), then crash and reboot, the failure >of the write would be undetectable. Understood. However, the issue is data loss, not reboot-n-continue behavior or whether the process dies along with the machine. If you soft mount directories (yes, I know this is dangerous) your process will get an I/O failure if the server goes down -- indicating that you have lost >something<. Data loss is data loss -- with or without the process continuing to exist. I would think that the real solution here would be to have a crashed and rebooted server return some form of error on the next I/O request (what, I don't know offhand, perhaps ENXIO) if you are mounted async and the server crashes and reboots. At least you'd be notified that there is a potential data integrity problem that your software needs to investigate or report. >The decision as to whether data should be written "safely" or not should >logically rest with the client, not the server. This is why the >hack of an async server side configuration option is so dangerous. >The correct approach, of course, is the (unimplemented) RFS_WRITECACHE >NFS function.... >sigh< But for now, Prestoserve is the best solution. >--Geoff Arnold, PC-NFS architect(geoff@East.Sun.COM or geoff.arnold@Sun.COM)-- AGREED. The decision SHOULD be with the client. I believe that many systems would opt for the async choice, but I disagree with making it something you don't have control over at the client level. One other option would be to have fsync() on an NFS file return success only if all operations since the last fsync() or open() had succeeded. A crash is an exception condition here, since the client will not have executed an open() prior to the fsync() -- thus, in that case fsync() would return failure. If the client opens with O_SYNC, then you do only sync I/O. On a close() do an implied fsync(), and again return success only if all data "makes it". This does require keeping one bit of state around -- whether or not an "open" or "fsync" has been executed (a noted I/O error rates a "no" to that question). This is very close to the semantics of a local filesystem, and should be pretty easy to do. It also doesn't affect anything on existing software (except that reliability for programs that don't do a fsync() or check close() return values are at risk, but on a local disk in this case they would be too!) This is what one would expect on a local disk in the event of a disk failure -- if you didn't check close()'s return value you might mistakenly think your data all got there when it didn't. Prestoserve is not a total safety net -- it's hardware, and CAN fail. The risks there are exactly the same as a crash/disk failure/whatever. The only real saving grace there is that it doesn't fail often, having no moving parts. -- Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285 kdenning@nis.naitc.com "The most dangerous command on any computer is the carriage return." Disclaimer: The opinions here are solely mine and may or may not reflect those of the company.
beepy@terra.Eng.Sun.COM (Brian Pawlowski) (06/15/91)
[I lost the original article I was responding to so am replying to a followup] In article <1991Jun13.234448.16172@Firewall.Nielsen.Com>, kdenning@genesis.Naitc.Com (Karl Denninger) writes: > The MIPS systems I've used don't suffer from this problem. Then they probably suffer from other problems:-) That they don't have an export option indicating "safe" and "unsafe" raises a question... > I don't quite understand the fanatacism with which people preach the NFS > stateless nature, O_SYNC and all that. The fact is that a crash of a > LOCAL Unix machine with the normal block buffering scheme can easily cause > the loss of data -- in this case, the write(2) call returned "ok" but it > really might not be "OK"! This is true whether the problem is later found > to be a bad disk sector, the machine panicing, or any one of a number of > other causes. Normal disk I/O on Unix machines is NOT reliable enough to > say "if you get a good return from write(), the data is safely on disk". Good analysis for local operation. I would argue (below) that distributed operation is a little different (particularly in regards to assumptions and expected behaviour during failures of nodes involved as compared to assumptions made when a local component fails). > If you WANT reliable I/O, you open with O_SYNC and take the performance hit. > Why wasn't this option designed into NFS? It could have been set up so that > for Non-Unix clients (which expect reliable I/O and don't have a "buffer > cache" that can be disabled on a file I/O basis) default to O_SYNC mode... > this is easily handled by making the Unix "open()" hook set the "no sync" > flag... > > Or was this a short-cut that has just never been repaired? The "problem" with a distributed file system, as typified by NFS, is that modifying operations on the server held in buffer cache could be lost due to a crash/failure without the client *ever* being aware of it, if the updates are made asynchronously (some time in the future) to the persistent store (disk) following successful acknowledgement to the client. NFS simplifies the (likely) failure modes possible by adding the semantic to a modifying operation (write, create, etc) that the operation has been applied to persistent store. [I believe several studies on the reliability of hosts on the Internet point to non-catastrophic failures--SW failures:-)--as a primary cause of crashes. This semantic for NFS addresses this nicely. While having access to some number N of hosts increases availability of data to a given node, it offers so many more opportunities to experience an unexpected component failure (server crash) to give you a chance to lose data. NFS reduces its "critical state" assumptions, simplifies client-server operation, and I believe increases reliability through the requirement for flush-to-persistent storage.] I think that analogy to the a "local I/O" situation is flawed. A user gets immediate notification of a "OS crash" in the local case because his application crashes too. He has *little expectation* that all his data is safe and will probably take some action to investigate the situation. In the distributed case, things get fuzzier. Assuming for a moment that a given vendor implements buffered writes on an NFS server to increase performance (tossing the synchronous modify semantic), you have now introduced an interesting error class: silent data loss. The scenario introduced is that the server can acknowledge a final write by an application while holding several buffered data blocks for a client queued to write to disk. The server returns "OK", the client application is happy and exits. The server crashes before it is able to flush data. Blissfully unaware, the client (and user) continue working on other files on other servers, and do return to the server in question sometime later after it has rebooted--and lost data the user believes was written to disk on the server. I contend that the user expects the data to be on disk because *he knows* his machine has been running beyond the synchronization time of flushing data to disk from a client's perspective. To find that the data is *lost* some time in the future without having had an intervening client crash introduces an insidious error and I (further) contend violates a basic transparency property provided by NFS (of making remote files seem like local files--to a great degree). NFS does not provide "exact" local file system semantics for UNIX. The original design paper describes decision made in providing semantics and trade-offs to simplify implementation and reduce complexity of error recovery. One could envision a production DFS which buffers data on a server for increased performance in volatile storage. I believe most current (research) systems which do so take a rather cavalier attitude towards ensuring integrity of modified data on behalf of users. I would propose that you would want to introduce recovery mechanisms to allow a client to resubmit lost data due to a server crash--this introduces complex recovery scenarios to a DFS, and was left out of NFS in the original design. [Asynchronous writes after a fashion have been proposed for a protocol revision of NFS... Some time in the hazy future.] Comments on Write Performance for NFS: NFS is not so bad as would be inferred from the above discussion from a client's perspective on writes. A client OS *still* does read-ahead and write-behind for application I/O when talking to an NFS server through the use of BIODs. The close() system call semantic was extended to include a synchronous flush of all dirty modified pages when you close a file which ensures that any errors in flushing modified data to a server will be made available to the application. [The addition of the flush-on-close semantic to support asynchornous error return for NFS was a design trade-off vs. *totally* synchronous writes from the application perspective.] I believe this trade-off gets close to local file expected behaviour and eliminates silent data loss. [For expected likely failures-- SW crashes. Of course a hard disk crash burns everyone--but I believe this is *much less* expected.] This is not to say that write performance for NFS is outstanding:-) I am a proponent of improving write performance beyond current NFS levels. One immediate attack is to install a Presto board (Sun and DEC have this. Others?) Hell, it will accelerate your local synchronous modifying operations (like mkdir, etc). Another attack is to use a product like eNFS for accelerating large file writes. All improvements in this area (as the above solutions do) should recognize that distribution inherently introduces different (more interesting) failure modes, and that I for one (and I believe others) don't appreciate an implementation of a distributed file system which provides me with the wonderful possibilities of silent loss of critical data. > Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285 > kdenning@nis.naitc.com Brian Pawlowski last time I looked
guy@auspex.auspex.com (Guy Harris) (06/21/91)
>It's a well known phenomenon with early releases of SunOS 4. 8 nfsds >thrash the MMU/cache of a sun 3. This cache has exactly 8 slots; if only >the 8 nfsds are active, everything is fine; as soon as another process >starts to run, you have 9 processes scheduled round robin, and the >MMU/cache is managed LRU. Every context switch brings a cache reload, >which takes about 1ms on the machine you mention... and so on. Could you please point to the lines of source code in those early releases that either 1) cause NFS daemons not to release their user-mode address space and context or 2) cause context switches to processes without contexts (kernel processes such as NFS daemons) to reload the context register? If such code exists, it must have been in a release prior to SunOS 4.0.3, as: 1) the "nfs_svc()" call in SunOS 4.0.3, which the "nfsd" program makes to create an NFS daemon, calls "relvm()" which completely discards the user-mode address space of the process (e.g., it nulls out "u.u_procp->p_as"); 2) the context-switching code in SunOS 4.0.3 ("resume()") won't bother reloading the context register if it's switching to a process with no user-mode address space (i.e., with a null "u.u_procp->p_as"). Given that, the 8 vs. 4 has nothing whatsoever to do with the number of contexts in 4.0.3. Maybe releases prior to that missed 1) or 2), but I tend to doubt it.
jim@cs.strath.ac.uk (Jim Reid) (06/26/91)
In article <PCG.91Jun14194239@aberdb.aber.ac.uk> pcg@aber.ac.uk (Piercarlo Grandi) writes:
On 12 Jun 91 03:00:04 GMT, r_hockey@fennel.cc.uwa.oz.au said:
hockey> Has anyone seen this happen, We have a PC-NFS network with about
hockey> 50 PCs served by a SUN 3/160 when a user (using a 386-25)
hockey> indexes a large file with dbase III+ with both the dbase and
hockey> index file on the same mounted drive the system comes to
hockey> standstill. When we check the system we find all 8 nfsd's have
hockey> completely taken over the whole system.
It's a well known phenomenon with early releases of SunOS 4. 8 nfsds
thrash the MMU/cache of a sun 3. This cache has exactly 8 slots; if only
the 8 nfsds are active, everything is fine; as soon as another process
starts to run, you have 9 processes scheduled round robin, and the
MMU/cache is managed LRU. Every context switch brings a cache reload,
which takes about 1ms on the machine you mention... and so on.
This was a problem in the early days of SunOS version 3 (and before).
Since SunOS3.2, the kernel has had a routine called wakeup_one() which
was used to wake up exactly one idle nsfd process instead of them all.
This effectively eliminated the chache thrashing phenomonon: an nfsd
process would only get woken up if it had something to do.
If all the MMU contexts are in use and there are more runnable
processes, the cache thrashing will be negligible compared to other
system overheads - waiting for some locked kernel resource to be freed
to a disk I/O to complete for example.
Jim