[comp.protocols.nfs] SCO Unix problem with large executables over NFS?

eli (Steve Elias) (09/13/90)

can anyone confirm or deny the following as a bug or feature?

when we run large executables off of an NFS mounted drive under SCO
Unix, the process will sometimes die off randomly, occasionally
reporting that it has received a kill signal.  we've seen this
behavior with both gnu emacs and a large document prep system.

since we've put emacs and the doc system on a local hard disk, we've
never seen this sort of process death.

our ethernet is known to drop packets occasionally, leading me to:

theory #1.  ahem.  ahem.  theory #1, which is mine (ours):

when pagedaemon (or appropriate kernel portion) tries to page in some
requested text pages across NFS, and the network drops packet(s),
pagedaemon sends a kill signal to the process which needs the text
page.  restated: something causes the process to die ungracefully if
it can't get its requested page fast enough across NFS.  perhaps there
is some sort of "retry" parameter which can be adjusted.  i've never
seen this behavior on either HPs or Suns running executables across
NFS, so i doubt this is supposed to be happening.

can any of yall confirm or deny this behavior and/or theory?

btw, this "sco-list@uunet.uu.net" has proved to be very valuable for
obtaining timely fixes and feedback on SCO Unix.  three cheers!

[note that this question is posted separately to comp.protocols.nfs
 and comp.unix.sysv386.]
 
/eli

/*  eli@pws.bull.com   617 932 5598   fax 508 294 0101  */

jim@cs.strath.ac.uk (Jim Reid) (09/18/90)

In article <15814@know.pws.bull.com> eli (Steve Elias) writes:

   can anyone confirm or deny the following as a bug or feature?

   when we run large executables off of an NFS mounted drive under SCO
   Unix, the process will sometimes die off randomly, occasionally
   reporting that it has received a kill signal.  we've seen this
   behavior with both gnu emacs and a large document prep system.

   theory #1.  ahem.  ahem.  theory #1, which is mine (ours):

   when pagedaemon (or appropriate kernel portion) tries to page in some
   requested text pages across NFS, and the network drops packet(s),
   pagedaemon sends a kill signal to the process which needs the text
   page.  restated: something causes the process to die ungracefully if
   it can't get its requested page fast enough across NFS.  perhaps there
   is some sort of "retry" parameter which can be adjusted.  i've never
   seen this behavior on either HPs or Suns running executables across
   NFS, so i doubt this is supposed to be happening.

   can any of yall confirm or deny this behavior and/or theory?

You're on the right lines, but not quite correct.

If an NFS client pages across the network, it will be suspended until
the NFS action completes. It will not continue executing until the
data is read from the server or written to the server and an NFS reply
returned. (There's no question of not getting the page "fast enough".
It has to wait until the page arrives. What can be a problem is the
client and server dropping too many packets because of a mismatch in
the throughput of the ethernet interfaces and protocol handling code.)

In the case of paging in, the underlying transport protocol (UDP) will
put together the NFS "packet" from the server before handing it off to
NFS and then back to the suspended user process. If your network drops
packets, the client won't be able to re-assemble the data, so the
server retransmits the data. [To be more precise, the client
retransmits the same request and the server sends the data again.]
Eventually the client gets all the data it had asked for and the
kernel returns the page(s) to the waiting process.

Problems arise if the filesystem is soft mounted. If it was hard
mounted, clients and servers retransmit forever until success is
achieved. If soft mounted, NFS can return an error after some number
of retries and/or a timeout limit have been reached. This 'cannot
happen': it's akin to getting an error from a disk read or write
request. In theory, this should send a signal to the process which
causes it to terminate (a swap error has occurred). Some NFS
implementations apparently silently ignore the error and return a page
of null bytes to the user process! This may cause an immediate core
dump - illegal instruction or a segmentation violation. If you're
unlucky, the process gets a page of null data and doesn't realise it
until some time later.

In short, the answer is to hard mount your filesystems. It is a good
idea to do this anyway. Soft mounts don't buy you any worthwhile
advantages and can cause a lot of unpredictable trouble.

		Jim