[comp.sys.hp] HP-UX 7.0 cluster problem

cph@zurich.ai.mit.edu (Chris Hanson) (08/15/90)
We had some trouble with the 7.0 cluster software.  I'd like to
describe the problem and its solution so other people won't have the
same problem.

Our cluster consists of a model 850 server and a group of series 300
clients.  Some of the clients are model 350's with 32 megabytes of
RAM.

The problem we observed: writing files from the client to the server
took a very long time -- in particular, the following sequence
executed on one of our 350 clients consistently took 50 seconds
(elapsed real time) or more:

* Open a file on the server (we opened one in "/tmp/").

* Write a megabyte of data to the file.

* Close the file.

The actual time varied somewhat, but never dropped below 50 seconds.
When the sequence was instead executed on the cluster server, it took
1.7 seconds.  The sequence took about 5 seconds when the client wrote
to a file on an NFS server.

Another interesting fact: reading the one megabyte file from the
cluster server back to the client took about 5 seconds; reading from
an NFS server to the client took about 3.5 seconds.

We happened on a solution more or less by accident.  The solution is
to build the client's kernel with the parameter NBUF set to 16 (the
minimum value).  When this was done, the time for the above sequence
dropped to 6.5 seconds, which is much more consistent with our
expectations.  Unfortunately, this makes reading of files slightly
slower, but not so much that it is a problem.

Some questions for HP folks: Why is this happening, and why is it so
bad?  Especially, why is the default NBUF value so awful for this
case?  (The default is to dynamically configure, which in this case
means NBUF is 473.)  Is there something screwy with our kernel
configuration?  Is this specific to mixed clusters?  I'd appreciate
any answers.

Finally, even after this fix the times aren't really so great.  Is
there something I can do to tune this system so that it performs
better?

Here's the kernel configuration information from our dfile.  The only
difference between the kernel that worked reasonably and the one that
didn't is that the latter had no explicit definition of NBUF.

	dskless_node 1
	dskless_cbufs 9
	dskless_fsbufs 16
	dskless_mbufs 4
	ngcsp 0
	num_cnodes 0
	server_node 0
	serving_array_size 16
	* `using_array_size' must be the same size as `nproc'.
	using_array_size 150
	*
	dmmax 2048
	dmmin 16
	dmshm 2048
	dmtext 2048
	* Disable support for HP98248A (Floating Point Accelerator) 
	* by changing the 1 to a 0 in the following line.
	fpa 0
	maxdsiz 0x6400000
	maxssiz 0x0400000
	maxtsiz 0x1C00000
	maxuprc 100
	nbuf 16
	nproc 150
	ntext 150
	num_lan_cards 1
	shmmaxaddr 0x6A00000
	timezone 300

Here's the kernel configuration information from the server's S800
file:

	acctresume      4;
	acctsuspend     2;
	bufpages        0;
	dst             1;
	scroll_lines	100;
	maxdsiz         0x8000;
	maxssiz         0x1000;
	maxtsiz         0x8000;
	maxuprc         100;
	maxusers        32;
	msgmap          100;
	msgmax          8192;
	msgmnb          16384;
	msgmni          50;
	msgseg          1024;
	msgssz          8;
	msgtql          40;
	nbuf            0;
	ncallout        "(64 + NPROC)";
	netmeminit      0;
	netmemmax       "(1024 * NETCLBYTES)";
	netmemthresh    0;
	nfile           "(16 * (NPROC + 16 + MAXUSERS) / 10 + 32 + 2 * NETSLOP)";
	nflocks         200;
	ninode          8192;
	nproc           512;
	npty            60;
	ntext           128;
	semaem          16384;
	semmap          10;
	semmni          10;
	semmns          60;
	semmnu          30;
	semume          10;
	semvmx          32767;
	shmmni          100;
	shmmax          0x4000000;
	shmseg          12;
	timeslice       "(HZ/10)";
	timezone        300;
	unlockable_mem  0;

	server_node 1;
	num_cnodes 40;
	ngcsp "(8 * NUM_CNODES)";