[comp.protocols.nfs] NFS Performance

golds@fjc.GOV (Rich Goldschmidt) (05/28/91)

I posted a request here for info about two weeks ago with no response, so
I'll try again.  The original query was about how file server performance
was measured.  What do claims of "X" users supported really mean, and how are
those measurements made.  And how do NFS and Novell compare on the same 
hardware, like a 386 or 486 running either Novell 3.11 or Interactive with
NFS.  There must be people out there who know answers to at least some of
these questions!

And just to broaden the scope, what are peoples experiences out there with
Interactive's NFS.  I have observed that the documentation has errors and is
incomplete, and the authentication doesn't work the way it ought to.  What is
of greater concern is I am seeing slower NFS file service performance than I 
had expected, and getting what I think are a lot of NFS errors.



-- 
Rich Goldschmidt: uunet!fjcp60!golds or golds@fjc.gov
Commercialization of space is the best way to escape the zero-sum economy.
Disclaimer: I don't speak for the government, and it doesn't speak for me...

kdenning@genesis.Naitc.Com (Karl Denninger) (05/28/91)

In article <427@fjcp60.GOV> golds@fjc.GOV (Rich Goldschmidt) writes:
>hardware, like a 386 or 486 running either Novell 3.11 or Interactive with
>NFS.  There must be people out there who know answers to at least some of
>these questions!

Don't even bother comparing these.  They aren't comparable.

Compare Novell on a 386 to a REAL machine, like a MIPS Magnum.  These are 
about the same price point once you buy the Novell software and the 386
machine.  With a REAL server, NFS to a PC running B&W's package is faster
than Novell, both under single-user and multiuser loads.

>And just to broaden the scope, what are peoples experiences out there with
>Interactive's NFS.  I have observed that the documentation has errors and is
>incomplete, and the authentication doesn't work the way it ought to.  What is
>of greater concern is I am seeing slower NFS file service performance than I 
>had expected, and getting what I think are a lot of NFS errors.

ISC's NFS is horrid.  The best I've seen on writes is about 30k/sec on a
fast '386 machine.  The CPU is not saturated, nor is the Ethernet card.  The
problem appears to be in either the Streams modules or the TCP and NFS code
itself.

ISC's NFS is also an older revision, and doesn't support root-remapping on a
mount-point basis.

Blech.

--
Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285
kdenning@nis.naitc.com

"The most dangerous command on any computer is the carriage return."
Disclaimer:  The opinions here are solely mine and may or may not reflect
  	     those of the company.

rbraun@spdcc.COM (Rich Braun) (05/29/91)

kdenning@genesis.Naitc.Com (Karl Denninger) writes:
>ISC's NFS is horrid.  The best I've seen on writes is about 30k/sec on a
>fast '386 machine.  The CPU is not saturated, nor is the Ethernet card.

SOSS is almost that fast, running on an 8-bit card with a 20-MHz 386
and talking to a Novell server through a second network hop, using 1K
RPC writes.  ISC can't be *that* bad, can it?

-rich

alex@mks.com (Alex White) (05/30/91)

In article <7678@spdcc.SPDCC.COM> rbraun@spdcc.COM (Rich Braun) writes:
>kdenning@genesis.Naitc.Com (Karl Denninger) writes:
>>ISC's NFS is horrid.  The best I've seen on writes is about 30k/sec on a
>>fast '386 machine.  The CPU is not saturated, nor is the Ethernet card.
>
>SOSS is almost that fast, running on an 8-bit card with a 20-MHz 386
>and talking to a Novell server through a second network hop, using 1K
>RPC writes.  ISC can't be *that* bad, can it?
>
>-rich

Can't be *that* bad?  Sure can!
However, I've found that the CPU is indeed saturated.
Thats on a 33Mhz 386 with a 16bit ethernet card.
Mind you, I was also trying to figure out why backing up over
the network (rsh ISCmachine 'find | cpio' >/dev/rmt0) and after
breaking it all down, found even the find by itself with no network
traffic used a good healthy hunk of the cpu.  Personally, I think ISC
just has a horridly slow nami().

Does anybody know if any of the fancy dandy ethernet cards with tcp/ip
on them have any kind of driver that would work?  And would they end up
faster?  [If the problem is in streams or something, maybe; but if its
in nami() not a hope!]

kdenning@genesis.Naitc.Com (Karl Denninger) (05/30/91)

In article <7678@spdcc.SPDCC.COM> rbraun@spdcc.COM (Rich Braun) writes:
>kdenning@genesis.Naitc.Com (Karl Denninger) writes:
>>ISC's NFS is horrid.  The best I've seen on writes is about 30k/sec on a
>>fast '386 machine.  The CPU is not saturated, nor is the Ethernet card.
>
>SOSS is almost that fast, running on an 8-bit card with a 20-MHz 386
>and talking to a Novell server through a second network hop, using 1K
>RPC writes.  ISC can't be *that* bad, can it?

It will also hang the TCP stack entirely if you mount a ISC disk from a Sun
(or other real workstation) with 8k read/write block sizes and blast a few
hundred K of data back and forth.

If you run 1k blocks, it works.  VERY slowly.

PCs can't load it heavily enough to crash it, but they run real slow.

Yes, ISC's NFS is really that bad (This is 2.2 with or without their
"update" installed).

I'd love a REAL NFS implementation on top of a Real Unix for the 386.  It
doesn't exist to the best of my knowledge.

--
Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285
kdenning@nis.naitc.com

"The most dangerous command on any computer is the carriage return."
Disclaimer:  The opinions here are solely mine and may or may not reflect
  	     those of the company.

geoff@hinode.East.Sun.COM (Geoff Arnold @ Sun BOS - R.H. coast near the top) (05/30/91)

Quoth kdenning@genesis.Naitc.Com (Karl Denninger) (in <1991May30.045522.6246@Firewall.Nielsen.Com>):
#I'd love a REAL NFS implementation on top of a Real Unix for the 386.  It
#doesn't exist to the best of my knowledge.

Not even SVR4...? Maybe the folks on comp.unix.sysv386 have some
comparative numbers for ISC and their favourite brand of SVR4.

-- Geoff Arnold, PC-NFS architect, Sun Microsystems. (geoff@East.Sun.COM)   --
------------------------------------------------------------------------------
--     Sun Microsystems PC Distributed Systems ...                          --
--            ... soon to be a part of SunTech (stay tuned for details)     --

larryp@sco.COM (Larry Philps) (05/30/91)

In <1991May30.021412.22925@mks.com> alex@mks.com (Alex White) writes:
> 
> Does anybody know if any of the fancy dandy ethernet cards with tcp/ip
> on them have any kind of driver that would work?

There are a few out there.  The entire Excelan line (now being sold
by Federal Technologies), Interlan, and probably a couple more.

> And would they end up faster?

Typically this is a slower solution than host based protocols.  Why?
Because in order to keep the price down, the on board processor, the
one that runs your TCP/IP stack, is slow.  Typically a 6-12 Mhz 80186.
This is a lot slower than a 25+ Mhz 80486/R3000/SPARC/RS6000/PA-RISC/...
when it comes to fiddling bits and computing a checksum!

In general, there is no reason why this architecture cannot give you
equal or better performance that a host-based scheme, but you have to
put a *real* cpu on the network board.  If you find one, expect to pay
big bucks.

---
Larry Philps,	 SCO Canada, Inc.
Postman:  130 Bloor St. West, 10th floor, Toronto, Ontario.  M5S 1N5
InterNet: larryp@sco.COM  or larryp%scocan@uunet.uu.net
UUCP:	  {uunet,utcsri,sco}!scocan!larryp
Phone:	  (416) 922-1937

mikef@leland.Stanford.EDU (Michael Fallavollita) (05/30/91)

I believe Intel's Unix Rel. 4 has a full implementation of NFS.

                o
               o 
                \\
         -----  ||     Mike Fallavollita
       /_______\||
      |( O   O )|      fallavol@corvus.arc.nasa.gov [128.102.24.98]
      |  --^--  |         mikef@leland.stanford.edu [36.21.0.69]
       \   -   /
         -----
-- 
                o
               o 
                \\
         -----  ||     Mike Fallavollita

ian@unipalm.uucp (Ian Phillipps) (05/30/91)

rbraun@spdcc.COM (Rich Braun) writes:

>kdenning@genesis.Naitc.Com (Karl Denninger) writes:
>>ISC's NFS is horrid.  The best I've seen on writes is about 30k/sec on a
>>fast '386 machine.  The CPU is not saturated, nor is the Ethernet card.

>SOSS is almost that fast, running on an 8-bit card with a 20-MHz 386
>and talking to a Novell server through a second network hop, using 1K
>RPC writes.  ISC can't be *that* bad, can it?

No: I just copied /unix from an ISC 386 to a Sun 3/50 (yeah!) in 6 seconds
elapsed; thats approx 120k/second.

Ian

brian@telebit.com (Brian Lloyd) (05/31/91)

There exists an nfsstone benchmark program.  I am not sure where it
came from but I will look into it.  I suspect that it came from
comp.sources.* and can probably be found on uunet.


-- 
Brian Lloyd, WB6RQN                              Telebit Corporation
Network Systems Architect                        1315 Chesapeake Terrace 
brian@napa.telebit.com                           Sunnyvale, CA 94089-1100
voice (408) 745-3103                             FAX (408) 734-3333

kdenning@genesis.Naitc.Com (Karl Denninger) (05/31/91)

In article <1991May30.165457.26093@unipalm.uucp> ian@unipalm.uucp (Ian Phillipps) writes:
>rbraun@spdcc.COM (Rich Braun) writes:
>
>>kdenning@genesis.Naitc.Com (Karl Denninger) writes:
>>>ISC's NFS is horrid.  The best I've seen on writes is about 30k/sec on a
>>>fast '386 machine.  The CPU is not saturated, nor is the Ethernet card.
>
>>SOSS is almost that fast, running on an 8-bit card with a 20-MHz 386
>>and talking to a Novell server through a second network hop, using 1K
>>RPC writes.  ISC can't be *that* bad, can it?
>
>No: I just copied /unix from an ISC 386 to a Sun 3/50 (yeah!) in 6 seconds
>elapsed; thats approx 120k/second.

Note the direction here.

You were READING from the ISC machine.

Now try to copy it the other direction.  You'll either (1) hang TCP/IP,
if you have the default buffer size, or (2) get horrible throughput if
you're using the 1k block size.

The Sun will swamp the ISC machine's protocol stack, and blow it sky high
with 8k (actually, anything more than 1k) block size.

--
Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285
kdenning@nis.naitc.com

"The most dangerous command on any computer is the carriage return."
Disclaimer:  The opinions here are solely mine and may or may not reflect
  	     those of the company.

ian@unipalm.uucp (Ian Phillipps) (06/01/91)

kdenning@genesis.Naitc.Com (Karl Denninger) writes:

>In article <1991May30.165457.26093@unipalm.uucp> ian@unipalm.uucp (Ian Phillipps) writes:
>>rbraun@spdcc.COM (Rich Braun) writes:
>>
>>>kdenning@genesis.Naitc.Com (Karl Denninger) writes:
>>>>ISC's NFS is horrid.  The best I've seen on writes is about 30k/sec on a
>>>>fast '386 machine.  The CPU is not saturated, nor is the Ethernet card.
>>
>>
>>No: I just copied /unix from an ISC 386 to a Sun 3/50 (yeah!) in 6 seconds
>>elapsed; thats approx 120k/second.

>Note the direction here.

>You were READING from the ISC machine.

>Now try to copy it the other direction.  You'll either (1) hang TCP/IP,
>if you have the default buffer size, or (2) get horrible throughput if
>you're using the 1k block size.

Sorry - not reading carefully enough.

I just tried it - 25k/second, using the NFS server on the ISC.
Not too sure of the block size settings, but it tallies with the performance
given above.

>The Sun will swamp the ISC machine's protocol stack, and blow it sky high
>with 8k (actually, anything more than 1k) block size.

We have a very slow Sun...

niklas@appli.se (Niklas Hallqvist) (06/02/91)

kdenning@genesis.Naitc.Com (Karl Denninger) writes:

>ISC's NFS is horrid.  The best I've seen on writes is about 30k/sec on a
>fast '386 machine.  The CPU is not saturated, nor is the Ethernet card.  The
>problem appears to be in either the Streams modules or the TCP and NFS code
>itself.

>ISC's NFS is also an older revision, and doesn't support root-remapping on a
>mount-point basis.

>Blech.

These figures chocked me and I didn't believe it first, so I had to check them.
Truely enough, I got only 25k/sec on transfer from a SCO Unix to an ISC 2.0.2.
The other way around it was 200k/sec, that's an order of magnitude involved here!
What startles me is that I think Lachman has written the NFS implementation for
both SCO and ISC, but that might be wrong.  But if it is not, I think ISC should
get in touch with Lachman and talk about an upgrade of ISC's NFS.  If that will
not happen, our company seriously has to reconsider the choice of ISC over SCO,
as I'm sure others will too, now that this fact has come to the net's attention.
This problem would've caused me lots of performance problems very soon, if I
had not seen this message, thanks Karl.  Yeah, I know I should've done some
benchmarking before deciding OS on our short-to-be backup-server, but lazyness
is one of my great virtues (at least Larry Wall thinks that's a virtue...).

						Niklas

-- 
Niklas Hallqvist	Phone: +46-(0)31-40 75 00
Applitron Datasystem	Fax:   +46-(0)31-83 39 50
Molndalsvagen 95	Email: niklas@appli.se
S-412 63  GOTEBORG, Sweden     mcsun!sunic!chalmers!appli!niklas

r_hockey@fennel.cc.uwa.oz.au (06/12/91)

Has anyone seen this happen,
We have a PC-NFS network with about 50 PCs served by a SUN 3/160
when a user (using a 386-25) indexes a large file with dbase III+ with
both the dbase and index file on the same mounted drive the system 
comes to standstill.  When we check the system we find all 8 nfsd's
have completely taken over the whole system.

Richard Hockey
Public Health
UDM
University of Western Australia
NEDLANDS WA 6009

lm@slovax.Eng.Sun.COM (Larry McVoy) (06/13/91)

niklas@appli.se (Niklas Hallqvist) writes:
> Truely enough, I got only 25k/sec on transfer from a SCO Unix to an ISC 2.0.2.
> The other way around it was 200k/sec, that's an order of magnitude involved here!

I suspect you are running into the following NFSism: all writes to an NFS
server are turned into sync writes on the server (like you opened with
O_SYNC).  This is very slow, large transfers can be 3x or more slower on
writes than reads.

The reason for this has to do with NFS' stateless nature - it can't ACK
the write until the data is safe; otherwise the server could crash and the
client would lose data.

> What startles me is that I think Lachman has written the NFS implementation for
> both SCO and ISC, but that might be wrong.  

Lachman didn't write either - they ported them.
---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

kdenning@genesis.Naitc.Com (Karl Denninger) (06/13/91)

In article <623@appserv.Eng.Sun.COM> lm@slovax.Eng.Sun.COM (Larry McVoy) writes:
>niklas@appli.se (Niklas Hallqvist) writes:
>> Truely enough, I got only 25k/sec on transfer from a SCO Unix to an ISC 2.0.2.
>> The other way around it was 200k/sec, that's an order of magnitude involved here!
>
>I suspect you are running into the following NFSism: all writes to an NFS
>server are turned into sync writes on the server (like you opened with
>O_SYNC).  This is very slow, large transfers can be 3x or more slower on
>writes than reads.
>
>The reason for this has to do with NFS' stateless nature - it can't ACK
>the write until the data is safe; otherwise the server could crash and the
>client would lose data.

The interesting thing is, there is little or no disk activity going on (from
a look at the wait I/O times and queues)..... on a Sun, on the other hand,
there IS a lot of disk activity during an NFS write operation.

The MIPS systems I've used don't suffer from this problem.

I don't quite understand the fanatacism with which people preach the NFS
stateless nature, O_SYNC and all that.  The fact is that a crash of a 
LOCAL Unix machine with the normal block buffering scheme can easily cause 
the loss of data -- in this case, the write(2) call returned "ok" but it 
really might not be "OK"!  This is true whether the problem is later found 
to be a bad disk sector, the machine panicing, or any one of a number of 
other causes.  Normal disk I/O on Unix machines is NOT reliable enough to 
say "if you get a good return from write(), the data is safely on disk".

If you WANT reliable I/O, you open with O_SYNC and take the performance hit.

Why wasn't this option designed into NFS?  It could have been set up so that 
for Non-Unix clients (which expect reliable I/O and don't have a "buffer 
cache" that can be disabled on a file I/O basis) default to O_SYNC mode... 
this is easily handled by making the Unix "open()" hook set the "no sync" 
flag...

Or was this a short-cut that has just never been repaired?

--
Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285
kdenning@nis.naitc.com

"The most dangerous command on any computer is the carriage return."
Disclaimer:  The opinions here are solely mine and may or may not reflect
  	     those of the company.

--
Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285
kdenning@nis.naitc.com

"The most dangerous command on any computer is the carriage return."
Disclaimer:  The opinions here are solely mine and may or may not reflect
  	     those of the company.

lm@slovax.Eng.Sun.COM (Larry McVoy) (06/14/91)

kdenning@genesis.Naitc.Com (Karl Denninger) writes:
> >The reason for this has to do with NFS' stateless nature - it can't ACK
> >the write until the data is safe; otherwise the server could crash and the
> >client would lose data.
> 
> The interesting thing is, there is little or no disk activity going on (from
> a look at the wait I/O times and queues)..... on a Sun, on the other hand,
> there IS a lot of disk activity during an NFS write operation.
> 
> The MIPS systems I've used don't suffer from this problem.
> 
> I don't quite understand the fanatacism with which people preach the NFS
> stateless nature, O_SYNC and all that.  The fact is that a crash of a 
> LOCAL Unix machine with the normal block buffering scheme can easily cause 
> the loss of data -- in this case, the write(2) call returned "ok" but it 
> really might not be "OK"!  This is true whether the problem is later found 
> to be a bad disk sector, the machine panicing, or any one of a number of 
> other causes.  Normal disk I/O on Unix machines is NOT reliable enough to 
> say "if you get a good return from write(), the data is safely on disk".

NFS is stateless.  The reason for this statelessness is so that a client
does not need to do anything special when a server goes down.  A dead
server looks just like a slow server to a client.

A client issues a write, the server ACKs the write.  What does that ACK
mean?  It means that the client data is safe.  The client kernel may
throw away the data, the server has promised that the data can be 
retrieved.

If the server ACKs the data before writing it to disk, there is a window
during which the server can crash.  The data is then lost.  

MIPS systems have an unsafe export option that allows you to turn off
this constraint - big performance win, big safety lose.

There are other ways to address this problem without breaking the 
semantics of NFS.  One such way is to buffer the writes in NVRAM.
---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

kdenning@genesis.Naitc.Com (Karl Denninger) (06/14/91)

In article <625@appserv.Eng.Sun.COM> lm@slovax.Eng.Sun.COM (Larry McVoy) writes:
>kdenning@genesis.Naitc.Com (Karl Denninger) writes:
>> 
>> I don't quite understand the fanatacism with which people preach the NFS
>> stateless nature, O_SYNC and all that.  The fact is that a crash of a 
>> LOCAL Unix machine with the normal block buffering scheme can easily cause 
>> the loss of data -- in this case, the write(2) call returned "ok" but it 
>> really might not be "OK"!  This is true whether the problem is later found 
>> to be a bad disk sector, the machine panicing, or any one of a number of 
>> other causes.  Normal disk I/O on Unix machines is NOT reliable enough to 
>> say "if you get a good return from write(), the data is safely on disk".
>
>NFS is stateless.  The reason for this statelessness is so that a client
>does not need to do anything special when a server goes down.  A dead
>server looks just like a slow server to a client.

So far, so good.

>A client issues a write, the server ACKs the write.  What does that ACK
>mean?  It means that the client data is safe.  The client kernel may
>throw away the data, the server has promised that the data can be 
>retrieved.
>
>If the server ACKs the data before writing it to disk, there is a window
>during which the server can crash.  The data is then lost.  

How does this differ from the standard "Unix" way of doing file I/O, which
returns a successful reply from a write call before the data is safely on
disk?  If you write data, get back a "ACK" (or good return value) the data
isn't necessarially on disk -- it could be in the buffer cache.  If the 
machine crashes before the data is flushed you lose.

I can't see how this is any different than ACKing packets from NFS clients
when you haven't actually written them any further than the buffer cache
(exactly the same as the standard Unix semantics).  You have the same risks
if the server (the machine with the disk on it :-) crashes as you would with
a local workstation or server drive.  In both cases data can be lost.

>MIPS systems have an unsafe export option that allows you to turn off
>this constraint - big performance win, big safety lose.

There is no export option in the manual pages for RiscOS 4.51 which 
addresses what you're talking about.  I just checked again; it's not there.

>There are other ways to address this problem without breaking the 
>semantics of NFS.  One such way is to buffer the writes in NVRAM.

Like Legato's PrestoServe.  Yes, I know.

That is not completely safe either.  You could have "something happen" to 
the presto board -- and your data would be lost.

The point is that standard Unix machines often say "your data is safe" when
it really isn't.  In fact, ALL systems by the virture of the fact that
hardware can fail make this assumption.  I don't see what you buy by having 
the default for NFS transactions be more "safe" than a local disk drive -- 
other than making recovery from crashes simple for the client side.

I would think that one of the easiest ways to address this would be to allow
an option to have "safe" or "unsafe" writes on a per-mount basis.  This
allows the user to choose his level of performance and risk, and make
his/her own choice.  I'd be for that.

--
Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285
kdenning@nis.naitc.com

"The most dangerous command on any computer is the carriage return."
Disclaimer:  The opinions here are solely mine and may or may not reflect
  	     those of the company.

k2@bl.physik.tu-muenchen.de (Klaus Steinberger) (06/14/91)

kdenning@genesis.Naitc.Com (Karl Denninger) writes:

>>MIPS systems have an unsafe export option that allows you to turn off
>>this constraint - big performance win, big safety lose.

>There is no export option in the manual pages for RiscOS 4.51 which 
>addresses what you're talking about.  I just checked again; it's not there.

It's not in the exports file, it's in fstab, her is the relevant part of
the manpage (FSTAB(SysV)):

          nfs_sync|nfs_async
                    Typically, NFS performs synchronous writes;
                    however, performing asynchronous writes can
                    speed-up performance tremendously. This
                    option allows nfs servers to control their
                    local file system behavior for NFS write
                    requests. Synchronous writes guarantee the
                    data has been written to disk rather than
                    guaranteeing that a server has correctly
                    received a write request. This only becomes
                    an issue if a server crashes during a write.
                    See kopt (8) to change the global default
                    value. nfs_async is the current default.

As you can see, the default for the server is ASYNC. You can change the
behaviour on a per filesystem basis at mount time, or a system startup
globally with kopt. We use the default of async, and we are happy.

Sincerely,
Klaus Steinberger

--
Klaus Steinberger               Beschleunigerlabor der TU und LMU Muenchen
Phone: (+49 89)3209 4287        Hochschulgelaende
FAX:   (+49 89)3209 4280        D-8046 Garching, Germany
BITNET: K2@DGABLG5P             Internet: k2@bl.physik.tu-muenchen.de

thurlow@convex.com (Robert Thurlow) (06/14/91)

In <1991Jun13.164017.29944@Firewall.Nielsen.Com> kdenning@genesis.Naitc.Com (Karl Denninger) writes:

>Why wasn't [O_SYNC] designed into NFS?

Um, we picked this up from the NFSSRC 4.0 release.  Pages get written
to the buffer cache so they can be read next time, but your write(2)
blocks until all I/O is completed on those pages.  If Sun has it, and
Sun licensees have it, in what sense is more needed?  If your vendor
doesn't do this, and does have a buffer cache and biods, just ask them
for the option.

Rob T
--
Rob Thurlow, thurlow@convex.com
An employee and not a spokesman for Convex Computer Corp., Dallas, TX

droms@regulus.bucknell.edu (Ralph E. Droms) (06/14/91)

In article <1991Jun13.234448.16172@Firewall.Nielsen.Com> kdenning@genesis.Naitc.Com (Karl Denninger) writes:

   >
   >If the server ACKs the data before writing it to disk, there is a window
   >during which the server can crash.  The data is then lost.  

   How does this differ from the standard "Unix" way of doing file I/O, which
   returns a successful reply from a write call before the data is safely on
   disk?  If you write data, get back a "ACK" (or good return value) the data
   isn't necessarially on disk -- it could be in the buffer cache.  If the 
   machine crashes before the data is flushed you lose.

I think the difference lies in the feedback to the user.  If the local
UNIX box crashes, the user is aware "something is wrong" immediately.
If the server crashes and reboots, the data can be lost silently...

--
- Ralph Droms                 Computer Science Department
  droms@bucknell.edu          323 Dana Engineering
                              Bucknell University
  (717) 524-1145              Lewisburg, PA 17837

geoff@hinode.East.Sun.COM (Geoff Arnold @ Sun BOS - R.H. coast near the top) (06/14/91)

Quoth droms@bucknell.edu (in <DROMS.91Jun14092449@regulus.bucknell.edu>):
#In article <1991Jun13.234448.16172@Firewall.Nielsen.Com> kdenning@genesis.Naitc.Com (Karl Denninger) writes:
#
#   >
#   >If the server ACKs the data before writing it to disk, there is a window
#   >during which the server can crash.  The data is then lost.  
#
#   How does this differ from the standard "Unix" way of doing file I/O, which
#   returns a successful reply from a write call before the data is safely on
#   disk?  If you write data, get back a "ACK" (or good return value) the data
#   isn't necessarially on disk -- it could be in the buffer cache.  If the 
#   machine crashes before the data is flushed you lose.
#
#I think the difference lies in the feedback to the user.  If the local
#UNIX box crashes, the user is aware "something is wrong" immediately.
#If the server crashes and reboots, the data can be lost silently...

It's more than simply a vague "feedback to the user": it's a
question of what assertions can be made about the correctness
of file system operations. Even though normal buffer cache
operations can reorder some kinds of operation, I can code something
like

	write(file1, data1)
	fsync(file1)
	write(file2, "file1 was written successfully")

(with appropriate error checking) and be confident that file2 will
be written if and only if file1 was written. Karl's "standard Unix way"
doesn't apply here: if the machine crashes, the process will crash
with it. If an NFS server could ack the first write (but not
commit it to stable storage), then crash and reboot, the failure
of the write would be undetectable.

The decision as to whether data should be written "safely" or not should
logically rest with the client, not the server. This is why the
hack of an async server side configuration option is so dangerous.
The correct approach, of course, is the (unimplemented) RFS_WRITECACHE
NFS function.... >sigh< But for now, Prestoserve is the best solution.
--Geoff Arnold, PC-NFS architect(geoff@East.Sun.COM or geoff.arnold@Sun.COM)--
------------------------------------------------------------------------------
--       Sun Technology Enterprises : PC Networking group                   --
--   (officially from July 1, but effectively in place right now)           --

thurlow@convex.com (Robert Thurlow) (06/14/91)

In <1991Jun13.234448.16172@Firewall.Nielsen.Com> kdenning@genesis.Naitc.Com (Karl Denninger) writes:

>I can't see how this is any different than ACKing packets from NFS clients
>when you haven't actually written them any further than the buffer cache
>(exactly the same as the standard Unix semantics).  You have the same risks
>if the server (the machine with the disk on it :-) crashes as you would with
>a local workstation or server drive.  In both cases data can be lost.

No you _don't_ have the same risks; you have a lot more points of
failure, like someone turning off your server and physically removing
it from your network, for example.  With NFS, you've taken a pretty
reliable disk I/O subsystem and put the disk maybe very far away, with
lots of failure points you didn't have before, and with other processes
able to alter the data outside of either your control or awareness.  To
some degree, though, you're still expecting it to obey perfect Unix
filesystem semantics.  It just ain't gonna work that way (though if Sun
fixed some of the protocol bugs in NFS, it'd be better).  An
implementor needs to think harder to get NFS to do The Right Thing.

Take this for an example: you're doing a 1K write to a filesystem with
an 8K blocksize, so you need to do a read/edit/write of a whole block.
What happens when the initial read tells you that the file has changed,
and that you should flush everything you know about the file out of
your buffer cache?  How do you hang onto the data you were trying to
write?  That isn't a problem over UFS.

>>MIPS systems have an unsafe export option that allows you to turn off
>>this constraint - big performance win, big safety lose.

>There is no export option in the manual pages for RiscOS 4.51 which 
>addresses what you're talking about.  I just checked again; it's not there.

Yup, we do this too, but we make a discouraging noise about it, and
it isn't the default like it is on Silicon Graphics machines.  It's
worth the kick for some things, though; my Sun (running 4.1.1) often
doesn't survive a server crash, so an 'unsafe' swap might be okay.

>I would think that one of the easiest ways to address this would be to allow
>an option to have "safe" or "unsafe" writes on a per-mount basis.  This
>allows the user to choose his level of performance and risk, and make
>his/her own choice.  I'd be for that.

It would make a good mount option, I agree.  Having a global decision
about such things made for you sucks.

Rob T
--
Rob Thurlow, thurlow@convex.com
An employee and not a spokesman for Convex Computer Corp., Dallas, TX

pcg@aber.ac.uk (Piercarlo Grandi) (06/15/91)

On 12 Jun 91 03:00:04 GMT, r_hockey@fennel.cc.uwa.oz.au said:

hockey> Has anyone seen this happen, We have a PC-NFS network with about
hockey> 50 PCs served by a SUN 3/160 when a user (using a 386-25)
hockey> indexes a large file with dbase III+ with both the dbase and
hockey> index file on the same mounted drive the system comes to
hockey> standstill.  When we check the system we find all 8 nfsd's have
hockey> completely taken over the whole system.

It's a well known phenomenon with early releases of SunOS 4. 8 nfsds
thrash the MMU/cache of a sun 3. This cache has exactly 8 slots; if only
the 8 nfsds are active, everything is fine; as soon as another process
starts to run, you have 9 processes scheduled round robin, and the
MMU/cache is managed LRU. Every context switch brings a cache reload,
which takes about 1ms on the machine you mention... and so on.

Just reduce the number of nfsds to 4... The problem will go away, and
you don't really lose IO performance between 8 and 4 nfsds, even if
there are people who argue to the contrary. Try it.

Or you could buy a Sun 4 which has many more MMU/cache contexts and runs
a newer release of the OS in which the problem has been worked around a bit.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk

kdenning@genesis.Naitc.Com (Karl Denninger) (06/15/91)

In article <6743@eastapps.East.Sun.COM> geoff@east.sun.com (Geoff Arnold @ Sun BOS - R.H. coast near the top) writes:
>Quoth droms@bucknell.edu (in <DROMS.91Jun14092449@regulus.bucknell.edu>):
>#In article <1991Jun13.234448.16172@Firewall.Nielsen.Com> kdenning@genesis.Naitc.Com (Karl Denninger) writes:
>#
>#   >
>#   >If the server ACKs the data before writing it to disk, there is a window
>#   >during which the server can crash.  The data is then lost.  
>#
>#   How does this differ from the standard "Unix" way of doing file I/O, which
>#   returns a successful reply from a write call before the data is safely on
>#   disk?  .....
>#
>#I think the difference lies in the feedback to the user.  If the local
>#UNIX box crashes, the user is aware "something is wrong" immediately.
>#If the server crashes and reboots, the data can be lost silently...
>
>It's more than simply a vague "feedback to the user": it's a
>question of what assertions can be made about the correctness
>of file system operations. Even though normal buffer cache
>operations can reorder some kinds of operation, I can code something
>like
>
>	write(file1, data1)
>	fsync(file1)
>	write(file2, "file1 was written successfully")
>
>(with appropriate error checking) and be confident that file2 will
>be written if and only if file1 was written. Karl's "standard Unix way"
>doesn't apply here: if the machine crashes, the process will crash
>with it. If an NFS server could ack the first write (but not
>commit it to stable storage), then crash and reboot, the failure
>of the write would be undetectable.

Understood.  However, the issue is data loss, not reboot-n-continue
behavior or whether the process dies along with the machine.  If you 
soft mount directories (yes, I know this is dangerous) your process will 
get an I/O failure if the server goes down -- indicating that you have 
lost >something<.

Data loss is data loss -- with or without the process continuing to exist.

I would think that the real solution here would be to have a crashed and
rebooted server return some form of error on the next I/O request (what, I 
don't know offhand, perhaps ENXIO) if you are mounted async and the server 
crashes and reboots.  At least you'd be notified that there is a potential 
data integrity problem that your software needs to investigate or report.

>The decision as to whether data should be written "safely" or not should
>logically rest with the client, not the server. This is why the
>hack of an async server side configuration option is so dangerous.
>The correct approach, of course, is the (unimplemented) RFS_WRITECACHE
>NFS function.... >sigh< But for now, Prestoserve is the best solution.
>--Geoff Arnold, PC-NFS architect(geoff@East.Sun.COM or geoff.arnold@Sun.COM)--

AGREED.  The decision SHOULD be with the client.  I believe that many
systems would opt for the async choice, but I disagree with making it
something you don't have control over at the client level.

One other option would be to have fsync() on an NFS file return success 
only if all operations since the last fsync() or open() had succeeded.  
A crash is an exception condition here, since the client will not have 
executed an open() prior to the fsync() -- thus, in that case fsync() 
would return failure.  If the client opens with O_SYNC, then you do only
sync I/O.  On a close() do an implied fsync(), and again return success 
only if all data "makes it".

This does require keeping one bit of state around -- whether or not an
"open" or "fsync" has been executed (a noted I/O error rates a "no" to that
question).

This is very close to the semantics of a local filesystem, and should be
pretty easy to do.  It also doesn't affect anything on existing software 
(except that reliability for programs that don't do a fsync() or check 
close() return values are at risk, but on a local disk in this case they 
would be too!)  This is what one would expect on a local disk in the event 
of a disk failure -- if you didn't check close()'s return value you might 
mistakenly think your data all got there when it didn't.

Prestoserve is not a total safety net -- it's hardware, and CAN fail.  The
risks there are exactly the same as a crash/disk failure/whatever.  The 
only real saving grace there is that it doesn't fail often, having no 
moving parts.

--
Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285
kdenning@nis.naitc.com

"The most dangerous command on any computer is the carriage return."
Disclaimer:  The opinions here are solely mine and may or may not reflect
  	     those of the company.

beepy@terra.Eng.Sun.COM (Brian Pawlowski) (06/15/91)

[I lost the original article I was responding to so am
replying to a followup]

In article <1991Jun13.234448.16172@Firewall.Nielsen.Com>, kdenning@genesis.Naitc.Com (Karl Denninger) writes:

> The MIPS systems I've used don't suffer from this problem.

Then they probably suffer from other problems:-) That they don't
have an export option indicating "safe" and "unsafe" raises a
question...

> I don't quite understand the fanatacism with which people preach the NFS
> stateless nature, O_SYNC and all that.  The fact is that a crash of a
> LOCAL Unix machine with the normal block buffering scheme can easily cause
> the loss of data -- in this case, the write(2) call returned "ok" but it
> really might not be "OK"!  This is true whether the problem is later found
> to be a bad disk sector, the machine panicing, or any one of a number of
> other causes.  Normal disk I/O on Unix machines is NOT reliable enough to
> say "if you get a good return from write(), the data is safely on disk".

Good analysis for local operation. I would argue (below) that
distributed operation is a little different (particularly in regards to
assumptions and expected behaviour during failures of nodes involved as
compared to assumptions made when a local component fails).

> If you WANT reliable I/O, you open with O_SYNC and take the performance hit.   > Why wasn't this option designed into NFS?  It could have been set up so that
> for Non-Unix clients (which expect reliable I/O and don't have a "buffer
> cache" that can be disabled on a file I/O basis) default to O_SYNC mode...
> this is easily handled by making the Unix "open()" hook set the "no sync"
> flag...
>
> Or was this a short-cut that has just never been repaired?

The "problem" with a distributed file system, as typified by NFS, is that
modifying operations on the server held in buffer cache could          
be lost due to a crash/failure without the client *ever* being aware
of it, if the updates are made asynchronously (some time in the future)
to the persistent store (disk) following successful acknowledgement to
the client. NFS simplifies the (likely) failure modes possible by
adding the semantic to a modifying operation (write, create, etc) that 
the operation has been applied to persistent store.

[I believe several studies on the reliability of hosts on the Internet
point to non-catastrophic failures--SW failures:-)--as a primary
cause of crashes. This semantic for NFS addresses this nicely. While
having access to some number N of hosts increases availability
of data to a given node, it offers so many more opportunities
to experience an unexpected component failure (server crash) to
give you a chance to lose data. NFS reduces its "critical state" assumptions,
simplifies client-server operation, and I believe increases reliability
through the requirement for flush-to-persistent storage.]

I think that analogy to the a "local I/O" situation is flawed.
A user gets immediate notification of a "OS crash" in the
local case because his application crashes too. He has *little
expectation* that all his data is safe and will probably take
some action to investigate the situation.

In the distributed case, things get fuzzier. Assuming for a moment
that a given vendor implements buffered writes on an NFS server
to increase performance (tossing the synchronous modify semantic),
you have now introduced an interesting error class: silent data
loss. The scenario introduced is that the server can acknowledge
a final write by an application while holding several buffered
data blocks for a client queued to write to disk. The server
returns "OK", the client application is happy and exits. The server
crashes before it is able to flush data.

Blissfully unaware, the client (and user) continue working on other
files on other servers, and do return to the server in question
sometime later after it has rebooted--and lost data the user believes
was written to disk on the server.

I contend that the user expects the data to be on disk because
*he knows* his machine has been running beyond the synchronization time
of flushing data to disk from a client's perspective. To find
that the data is *lost* some time in the future without having
had an intervening client crash introduces an insidious error
and I (further) contend violates a basic transparency property
provided by NFS (of making remote files seem like local files--to
a great degree).

NFS does not provide "exact" local file system semantics for UNIX.
The original design paper describes decision made in providing
semantics and trade-offs to simplify implementation and reduce complexity
of error recovery. One could envision a production DFS which buffers
data on a server for increased performance in volatile storage.
I believe most current (research) systems which do so take a
rather cavalier attitude towards ensuring integrity of modified
data on behalf of users. I would propose that you would want
to introduce recovery mechanisms to allow a client to resubmit
lost data due to a server crash--this introduces complex recovery
scenarios to a DFS, and was left out of NFS in the original design.
[Asynchronous writes after a fashion have been proposed for
a protocol revision of NFS... Some time in the hazy future.]

Comments on Write Performance for NFS:

NFS is not so bad as would be inferred from the above discussion
from a client's perspective on writes.

A client OS *still* does read-ahead and write-behind for application
I/O when talking to an NFS server through the use of BIODs.        
The close() system call semantic was extended to include a synchronous
flush of all dirty modified pages when you close a file which
ensures that any errors in flushing modified data to a server
will be made available to the application. [The addition of the
flush-on-close semantic to support asynchornous error return for
NFS was a design trade-off vs. *totally* synchronous writes
from the application perspective.]

I believe this trade-off gets close to local file expected behaviour
and eliminates silent data loss. [For expected likely failures--
SW crashes. Of course a hard disk crash burns everyone--but I
believe this is *much less* expected.]

This is not to say that write performance for NFS is outstanding:-) 
I am a proponent of improving write performance beyond current NFS
levels.                      

One immediate attack is to install a Presto board (Sun and DEC
have this. Others?) Hell, it will accelerate your local synchronous
modifying operations (like mkdir, etc). Another attack is
to use a product like eNFS for accelerating large file writes.

All improvements in this area (as the above solutions do) should
recognize that distribution inherently introduces different (more
interesting) failure modes, and that I for one (and I believe others)       
don't appreciate an implementation of a distributed file
system which provides me with the wonderful possibilities
of silent loss of critical data.                         

> Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285
> kdenning@nis.naitc.com

Brian Pawlowski
last time I looked

guy@auspex.auspex.com (Guy Harris) (06/21/91)

>It's a well known phenomenon with early releases of SunOS 4. 8 nfsds
>thrash the MMU/cache of a sun 3. This cache has exactly 8 slots; if only
>the 8 nfsds are active, everything is fine; as soon as another process
>starts to run, you have 9 processes scheduled round robin, and the
>MMU/cache is managed LRU. Every context switch brings a cache reload,
>which takes about 1ms on the machine you mention... and so on.

Could you please point to the lines of source code in those early
releases that either 1) cause NFS daemons not to release their user-mode
address space and context or 2) cause context switches to processes
without contexts (kernel processes such as NFS daemons) to reload the
context register?

If such code exists, it must have been in a release prior to SunOS
4.0.3, as:

1) the "nfs_svc()" call in SunOS 4.0.3, which the "nfsd" program makes
   to create an NFS daemon, calls "relvm()" which completely discards
   the user-mode address space of the process (e.g., it nulls out
   "u.u_procp->p_as");

2) the context-switching code in SunOS 4.0.3 ("resume()") won't bother
   reloading the context register if it's switching to a process with no
   user-mode address space (i.e., with a null "u.u_procp->p_as").

Given that, the 8 vs. 4 has nothing whatsoever to do with the number
of contexts in 4.0.3.  Maybe releases prior to that missed 1) or 2), but
I tend to doubt it.

jim@cs.strath.ac.uk (Jim Reid) (06/26/91)

In article <PCG.91Jun14194239@aberdb.aber.ac.uk> pcg@aber.ac.uk (Piercarlo Grandi) writes:


   On 12 Jun 91 03:00:04 GMT, r_hockey@fennel.cc.uwa.oz.au said:

   hockey> Has anyone seen this happen, We have a PC-NFS network with about
   hockey> 50 PCs served by a SUN 3/160 when a user (using a 386-25)
   hockey> indexes a large file with dbase III+ with both the dbase and
   hockey> index file on the same mounted drive the system comes to
   hockey> standstill.  When we check the system we find all 8 nfsd's have
   hockey> completely taken over the whole system.

   It's a well known phenomenon with early releases of SunOS 4. 8 nfsds
   thrash the MMU/cache of a sun 3. This cache has exactly 8 slots; if only
   the 8 nfsds are active, everything is fine; as soon as another process
   starts to run, you have 9 processes scheduled round robin, and the
   MMU/cache is managed LRU. Every context switch brings a cache reload,
   which takes about 1ms on the machine you mention... and so on.

This was a problem in the early days of SunOS version 3 (and before).
Since SunOS3.2, the kernel has had a routine called wakeup_one() which
was used to wake up exactly one idle nsfd process instead of them all.

This effectively eliminated the chache thrashing phenomonon: an nfsd
process would only get woken up if it had something to do.

If all the MMU contexts are in use and there are more runnable
processes, the cache thrashing will be negligible compared to other
system overheads - waiting for some locked kernel resource to be freed
to a disk I/O to complete for example.

		Jim