cph@zurich.ai.mit.edu (Chris Hanson) (08/15/90)
We had some trouble with the 7.0 cluster software. I'd like to describe the problem and its solution so other people won't have the same problem. Our cluster consists of a model 850 server and a group of series 300 clients. Some of the clients are model 350's with 32 megabytes of RAM. The problem we observed: writing files from the client to the server took a very long time -- in particular, the following sequence executed on one of our 350 clients consistently took 50 seconds (elapsed real time) or more: * Open a file on the server (we opened one in "/tmp/"). * Write a megabyte of data to the file. * Close the file. The actual time varied somewhat, but never dropped below 50 seconds. When the sequence was instead executed on the cluster server, it took 1.7 seconds. The sequence took about 5 seconds when the client wrote to a file on an NFS server. Another interesting fact: reading the one megabyte file from the cluster server back to the client took about 5 seconds; reading from an NFS server to the client took about 3.5 seconds. We happened on a solution more or less by accident. The solution is to build the client's kernel with the parameter NBUF set to 16 (the minimum value). When this was done, the time for the above sequence dropped to 6.5 seconds, which is much more consistent with our expectations. Unfortunately, this makes reading of files slightly slower, but not so much that it is a problem. Some questions for HP folks: Why is this happening, and why is it so bad? Especially, why is the default NBUF value so awful for this case? (The default is to dynamically configure, which in this case means NBUF is 473.) Is there something screwy with our kernel configuration? Is this specific to mixed clusters? I'd appreciate any answers. Finally, even after this fix the times aren't really so great. Is there something I can do to tune this system so that it performs better? Here's the kernel configuration information from our dfile. The only difference between the kernel that worked reasonably and the one that didn't is that the latter had no explicit definition of NBUF. dskless_node 1 dskless_cbufs 9 dskless_fsbufs 16 dskless_mbufs 4 ngcsp 0 num_cnodes 0 server_node 0 serving_array_size 16 * `using_array_size' must be the same size as `nproc'. using_array_size 150 * dmmax 2048 dmmin 16 dmshm 2048 dmtext 2048 * Disable support for HP98248A (Floating Point Accelerator) * by changing the 1 to a 0 in the following line. fpa 0 maxdsiz 0x6400000 maxssiz 0x0400000 maxtsiz 0x1C00000 maxuprc 100 nbuf 16 nproc 150 ntext 150 num_lan_cards 1 shmmaxaddr 0x6A00000 timezone 300 Here's the kernel configuration information from the server's S800 file: acctresume 4; acctsuspend 2; bufpages 0; dst 1; scroll_lines 100; maxdsiz 0x8000; maxssiz 0x1000; maxtsiz 0x8000; maxuprc 100; maxusers 32; msgmap 100; msgmax 8192; msgmnb 16384; msgmni 50; msgseg 1024; msgssz 8; msgtql 40; nbuf 0; ncallout "(64 + NPROC)"; netmeminit 0; netmemmax "(1024 * NETCLBYTES)"; netmemthresh 0; nfile "(16 * (NPROC + 16 + MAXUSERS) / 10 + 32 + 2 * NETSLOP)"; nflocks 200; ninode 8192; nproc 512; npty 60; ntext 128; semaem 16384; semmap 10; semmni 10; semmns 60; semmnu 30; semume 10; semvmx 32767; shmmni 100; shmmax 0x4000000; shmseg 12; timeslice "(HZ/10)"; timezone 300; unlockable_mem 0; server_node 1; num_cnodes 40; ngcsp "(8 * NUM_CNODES)";