IP problem

mike@antel.uucp (Michael Borza) (11/15/89)

Hi folks,

I've got a couple of interesting 386/ix problems that I've not been
able to solve, so I thought I'd throw them out here to see if
anyone else has seen them, and if so, how they were solved.

1. I've got ISC 386/ix 2.0.1 running on the following hardware:

    25 MHz 386 clone with 387
    8 MB DRAM
    Adaptec ACB 2312 ST-506 MFM floppy/hard disk controller
    Miniscribe 6085
    Miniscribe 3053
    single 1.2 MB, 5-1/4" floppy disk drive
    BTC VGA graphics controller
    Logitech serial mouse on /dev/tty00
    modem on /dev/tty01

The /dev/dsk/0s? partitions reside on the 6085, while the
/dev/dsk/1s? partitions are on the 3053.  X-Windows and TCP/IP
are also installed, but I believe they're irrelevent to this
discussion.  The disk partitioning and filesystems are set up
approximately as follows:

    /dev/dsk/0s1     /       ~30 MB
    /dev/dsk/0s3     /usr    ~30 MB
    /dev/dsk/1s1     /tmp    ~5 MB
    /dev/dsk/1s3     /usr2   ~35 MB

/dev/swap also lives on the 6085.  The second disk was installed
using sysadm, and it's partitions are automatically mounted at
boot-time.

Now the problem.  If I execute a normal shutdown with
/etc/shutdown, the next time I bring the system up, the first
execution of /etc/dfspace causes the system to hang completely,
not even echoing characters.  If I don't execute dfspace, the
system will apparently run indefinitely, and seemingly normally;
however, dfspace is not the only command which can cause the
hang.  `Pack' has also caused it on at least one occasion, and I
expect that other programs could also do the same thing, although
I haven't found any.  After a crash, once the system's back up
(after fsck'ing all of its filesystems) dfspace, pack, and
everything else run perfectly.

I have tried a number of things to get around this.  Manually
sync'ing and umounting the file systems before shutting down
doesn't improve the situation, nor does executing fsck on each
power-up.  I expect that what's happening is that some vital
information about the filesystems is not being updated during the
umount prior to shutdown.  This problem has not been observed
with just a single disk attached to the controller.  Anyone have
any ideas?  Right now, the safest way to shut the system down is
for me to sync the disk buffers and then powerdown; fsck puts
everything back in order after the boot, and the system runs
reliably.  I'm not too happy about doing that though.


The second problem involves TCP/IP.  I've got a second system
running 386/ix 1.0.6.  This system is a 20 MHz 386 with 387 and 10
MB DRAM.  Both systems run host-based TCP/IP, v1.1.2 on the 2.0.1
system and v1.0.3 on the 1.0.6 system.  Both systems are using
Western Digital WD8003E ethernet cards.  To perform backups, we
have a Wangtek tape controller driving an Archive FT-60E tape
drive attached to the 1.0.6 system.  To backup the 2.0.1 system,
I use the following command (executed from the 2.0.1 system):

   find . [...] | cpio -oc | \
     rsh node_name 'compress | dd ibs=1024k obs=1024k of=/dev/tape'

This frequently causes hangs on one or the other of the systems,
in which all system activity ceases (character echoing included).
I've played around with the number of dblocks, which changes how
early the hang occurs, but not ultimately whether it occurs at
some time.  Sometimes I can back up two or three partitions with
no problems, but if I keep doing it long enough, I eventually get
a hang.

This is the current dblock configuration on the 2.0.1 system:

		 alloc	 inuse	   total     max    fail
dblock class:
    0 (   4)	   128	     0	  345380       3       0
    1 (  16)	   128	    30	   41906      32       0
    2 (  64)	   128	    17	  214582     115      26
    3 ( 128)	   128	   108	   25676     115    9001
    4 ( 256)	   128	     0	   11969       8       0
    5 ( 512)	   256	     0	    3820       3       0
    6 (1024)	    32	     0	    2315       4       0
    7 (2048)	    16	     0	    1906       5       0
    8 (4096)	     8	     0	      16       1       0

The configuration on the 1.0.6 system is similar.  I've had
the number of buffers increased such that I get no failures
(that I've observed) in any class, but to no avail.  I've
also got 600 disk buffers allocated, and 300 clists.
Significantly increasing either the number of clists or the
number of dblocks seems to hasten the onset of the crash,
contrary to my intuitive expectations.

Sorry I've gone on so long, but I wanted to give accurate
information about these problems.  Any pointers to solutions to
either of these problems will be greatly appreciated.

thanks in advance,
mike borza.
-- 
Michael Borza              Antel Optronics Inc.
(416)335-5507              3325B Mainway, Burlington, Ont., Canada  L7M 1A6
work: mike@antel.UUCP  or  uunet!utai!utgpu!maccs!antel!mike
home: mike@boopsy.UUCP  or  uunet!utai!utgpu!maccs!boopsy!mike

larry@focsys.UUCP (Larry Williamson) (11/23/89)

In article <1989Nov14.175913.7840@antel.uucp> mike@antel.uucp (Michael
Borza) wrote describing a two disk problem and a tcp/ip problem.

I can't help you with your first problem, the disk related issues, but
I may be able to offer some insight to your tcp/ip woes.

Your tcp/ip problem description...

   This frequently causes hangs on one or the other of the systems,
   in which all system activity ceases (character echoing included).
   I've played around with the number of dblocks, which changes how
   early the hang occurs, but not ultimately whether it occurs at
   some time.

Sounds all too familiar. I've had that problem here for some time.
What I've learned so far is that you should set the number of dblocks
of each class (NBLK64, NBLK128 etc) big enough that there are never
any failures.

Your list (edited)...

   		 alloc	 inuse	   total     max    fail
   dblock class:
       1 (  16)	   128	    30	   41906      32       0
       2 (  64)	   128	    17	  214582     115      26  <<<<***
       3 ( 128)	   128	   108	   25676     115    9001  <<<<***
       4 ( 256)	   128	     0	   11969       8       0

shows some failures. These are not good.

My strategy has been, double the number of allocated dblock classes
until I get no more failures. So in your case, I would double NBLK64
and NBLK128 each to 256. If failures continue to show up, increase
them again. Also, watch for failures in the streams and queues.

Our two 2.0.2 systems are set up like this...

		 alloc	 inuse	   total     max    fail
streams:	    96	    40	     304      53       0
queues: 	   512	   216	    1702     294       0
mblocks: 	  3270	   735	 4625499     969       0
dblocks: 	  2616	   735	 3954806     969       0
dblock class:
    0 (   4)	   256	     1	  138298       7       0
    1 (  16)	   256	    26	  560779     130       0
    2 (  64)	  1024	   601	 3061270     819       0
    3 ( 128)	   512	   104	   47186     148       0
    4 ( 256)	   256	     0	   60700     123       0
    5 ( 512)	   128	     0	   26377      40       0
    6 (1024)	    64	     0	   23847      11       0
    7 (2048)	    64	     3	   36349       8       0
    8 (4096)	    56	     0	       0       0       0

dblock 64 seems to be our biggest headache. I see from this list it is
getting close to overflowing again.

It has been two days since I lasted booted this machine.

What I have seen is that sometimes, if you wait long enough, the
system *will* come back to life.  If you are so lucky, and your
machine continues to breathe, then do a netstat -m, you will likely
see one of the dblock classes with a failure count in the hundreds of
thousands or possibly in the millions. I believe the kernel is a loop
trying to get the dblocks over and over and over again. 

Having said all that, we still occasionally get these mysterious
hangs, but much less frequently now.

Also note that our load is different than yours. We have an RFS link
between the two 386/ix machines and up to 10 users rlogin'd in at any
given time from ms-dos and BSD4.3 machines. I haven't brought NFS up
yet to know what effect it will have.

-larry

david@hcr.uucp (David Fiander) (11/25/89)

In article <LARRY.89Nov22130745@focsys.UUCP> larry@focsys.UUCP (Larry Williamson) writes:
>
>Your list (edited)...
>
>   		 alloc	 inuse	   total     max    fail
>   dblock class:
>       1 (  16)	   128	    30	   41906      32       0
>       2 (  64)	   128	    17	  214582     115      26  <<<<***
>       3 ( 128)	   128	   108	   25676     115    9001  <<<<***
>       4 ( 256)	   128	     0	   11969       8       0
>
>shows some failures. These are not good.

Failures, of themselves are not bad, but at least the system no
longer panics when a dblock allocation request fails, which it does
on release 1.0.6

>
>My strategy has been, double the number of allocated dblock classes
>until I get no more failures. So in your case, I would double NBLK64
>and NBLK128 each to 256. If failures continue to show up, increase
>them again. Also, watch for failures in the streams and queues.
>
>Our two 2.0.2 systems are set up like this...
>
>		 alloc	 inuse	   total     max    fail
>streams:	    96	    40	     304      53       0
>queues: 	   512	   216	    1702     294       0
>mblocks: 	  3270	   735	 4625499     969       0
>dblocks: 	  2616	   735	 3954806     969       0
>dblock class:
>    0 (   4)	   256	     1	  138298       7       0
>    1 (  16)	   256	    26	  560779     130       0
>    2 (  64)	  1024	   601	 3061270     819       0
>    3 ( 128)	   512	   104	   47186     148       0
>    4 ( 256)	   256	     0	   60700     123       0
>    5 ( 512)	   128	     0	   26377      40       0
>    6 (1024)	    64	     0	   23847      11       0
>    7 (2048)	    64	     3	   36349       8       0
>    8 (4096)	    56	     0	       0       0       0
>
>dblock 64 seems to be our biggest headache. I see from this list it is
>getting close to overflowing again.
>
>It has been two days since I lasted booted this machine.

The problem is that there is (at least one) memory leak in the
kernel.  Under certain fairly common circumstances, the kernel will
forget about a dblock that it has allocated but no longer needs and
just drop it on the floor.  We saw this problem in 1.0.6, and gave
Interactive a fix for it.  Ask them.

David J. Fiander,
Networking Group,
HCR Corporation.