mike@antel.uucp (Michael Borza) (11/15/89)
Hi folks, I've got a couple of interesting 386/ix problems that I've not been able to solve, so I thought I'd throw them out here to see if anyone else has seen them, and if so, how they were solved. 1. I've got ISC 386/ix 2.0.1 running on the following hardware: 25 MHz 386 clone with 387 8 MB DRAM Adaptec ACB 2312 ST-506 MFM floppy/hard disk controller Miniscribe 6085 Miniscribe 3053 single 1.2 MB, 5-1/4" floppy disk drive BTC VGA graphics controller Logitech serial mouse on /dev/tty00 modem on /dev/tty01 The /dev/dsk/0s? partitions reside on the 6085, while the /dev/dsk/1s? partitions are on the 3053. X-Windows and TCP/IP are also installed, but I believe they're irrelevent to this discussion. The disk partitioning and filesystems are set up approximately as follows: /dev/dsk/0s1 / ~30 MB /dev/dsk/0s3 /usr ~30 MB /dev/dsk/1s1 /tmp ~5 MB /dev/dsk/1s3 /usr2 ~35 MB /dev/swap also lives on the 6085. The second disk was installed using sysadm, and it's partitions are automatically mounted at boot-time. Now the problem. If I execute a normal shutdown with /etc/shutdown, the next time I bring the system up, the first execution of /etc/dfspace causes the system to hang completely, not even echoing characters. If I don't execute dfspace, the system will apparently run indefinitely, and seemingly normally; however, dfspace is not the only command which can cause the hang. `Pack' has also caused it on at least one occasion, and I expect that other programs could also do the same thing, although I haven't found any. After a crash, once the system's back up (after fsck'ing all of its filesystems) dfspace, pack, and everything else run perfectly. I have tried a number of things to get around this. Manually sync'ing and umounting the file systems before shutting down doesn't improve the situation, nor does executing fsck on each power-up. I expect that what's happening is that some vital information about the filesystems is not being updated during the umount prior to shutdown. This problem has not been observed with just a single disk attached to the controller. Anyone have any ideas? Right now, the safest way to shut the system down is for me to sync the disk buffers and then powerdown; fsck puts everything back in order after the boot, and the system runs reliably. I'm not too happy about doing that though. The second problem involves TCP/IP. I've got a second system running 386/ix 1.0.6. This system is a 20 MHz 386 with 387 and 10 MB DRAM. Both systems run host-based TCP/IP, v1.1.2 on the 2.0.1 system and v1.0.3 on the 1.0.6 system. Both systems are using Western Digital WD8003E ethernet cards. To perform backups, we have a Wangtek tape controller driving an Archive FT-60E tape drive attached to the 1.0.6 system. To backup the 2.0.1 system, I use the following command (executed from the 2.0.1 system): find . [...] | cpio -oc | \ rsh node_name 'compress | dd ibs=1024k obs=1024k of=/dev/tape' This frequently causes hangs on one or the other of the systems, in which all system activity ceases (character echoing included). I've played around with the number of dblocks, which changes how early the hang occurs, but not ultimately whether it occurs at some time. Sometimes I can back up two or three partitions with no problems, but if I keep doing it long enough, I eventually get a hang. This is the current dblock configuration on the 2.0.1 system: alloc inuse total max fail dblock class: 0 ( 4) 128 0 345380 3 0 1 ( 16) 128 30 41906 32 0 2 ( 64) 128 17 214582 115 26 3 ( 128) 128 108 25676 115 9001 4 ( 256) 128 0 11969 8 0 5 ( 512) 256 0 3820 3 0 6 (1024) 32 0 2315 4 0 7 (2048) 16 0 1906 5 0 8 (4096) 8 0 16 1 0 The configuration on the 1.0.6 system is similar. I've had the number of buffers increased such that I get no failures (that I've observed) in any class, but to no avail. I've also got 600 disk buffers allocated, and 300 clists. Significantly increasing either the number of clists or the number of dblocks seems to hasten the onset of the crash, contrary to my intuitive expectations. Sorry I've gone on so long, but I wanted to give accurate information about these problems. Any pointers to solutions to either of these problems will be greatly appreciated. thanks in advance, mike borza. -- Michael Borza Antel Optronics Inc. (416)335-5507 3325B Mainway, Burlington, Ont., Canada L7M 1A6 work: mike@antel.UUCP or uunet!utai!utgpu!maccs!antel!mike home: mike@boopsy.UUCP or uunet!utai!utgpu!maccs!boopsy!mike
larry@focsys.UUCP (Larry Williamson) (11/23/89)
In article <1989Nov14.175913.7840@antel.uucp> mike@antel.uucp (Michael
Borza) wrote describing a two disk problem and a tcp/ip problem.
I can't help you with your first problem, the disk related issues, but
I may be able to offer some insight to your tcp/ip woes.
Your tcp/ip problem description...
This frequently causes hangs on one or the other of the systems,
in which all system activity ceases (character echoing included).
I've played around with the number of dblocks, which changes how
early the hang occurs, but not ultimately whether it occurs at
some time.
Sounds all too familiar. I've had that problem here for some time.
What I've learned so far is that you should set the number of dblocks
of each class (NBLK64, NBLK128 etc) big enough that there are never
any failures.
Your list (edited)...
alloc inuse total max fail
dblock class:
1 ( 16) 128 30 41906 32 0
2 ( 64) 128 17 214582 115 26 <<<<***
3 ( 128) 128 108 25676 115 9001 <<<<***
4 ( 256) 128 0 11969 8 0
shows some failures. These are not good.
My strategy has been, double the number of allocated dblock classes
until I get no more failures. So in your case, I would double NBLK64
and NBLK128 each to 256. If failures continue to show up, increase
them again. Also, watch for failures in the streams and queues.
Our two 2.0.2 systems are set up like this...
alloc inuse total max fail
streams: 96 40 304 53 0
queues: 512 216 1702 294 0
mblocks: 3270 735 4625499 969 0
dblocks: 2616 735 3954806 969 0
dblock class:
0 ( 4) 256 1 138298 7 0
1 ( 16) 256 26 560779 130 0
2 ( 64) 1024 601 3061270 819 0
3 ( 128) 512 104 47186 148 0
4 ( 256) 256 0 60700 123 0
5 ( 512) 128 0 26377 40 0
6 (1024) 64 0 23847 11 0
7 (2048) 64 3 36349 8 0
8 (4096) 56 0 0 0 0
dblock 64 seems to be our biggest headache. I see from this list it is
getting close to overflowing again.
It has been two days since I lasted booted this machine.
What I have seen is that sometimes, if you wait long enough, the
system *will* come back to life. If you are so lucky, and your
machine continues to breathe, then do a netstat -m, you will likely
see one of the dblock classes with a failure count in the hundreds of
thousands or possibly in the millions. I believe the kernel is a loop
trying to get the dblocks over and over and over again.
Having said all that, we still occasionally get these mysterious
hangs, but much less frequently now.
Also note that our load is different than yours. We have an RFS link
between the two 386/ix machines and up to 10 users rlogin'd in at any
given time from ms-dos and BSD4.3 machines. I haven't brought NFS up
yet to know what effect it will have.
-larry
david@hcr.uucp (David Fiander) (11/25/89)
In article <LARRY.89Nov22130745@focsys.UUCP> larry@focsys.UUCP (Larry Williamson) writes: > >Your list (edited)... > > alloc inuse total max fail > dblock class: > 1 ( 16) 128 30 41906 32 0 > 2 ( 64) 128 17 214582 115 26 <<<<*** > 3 ( 128) 128 108 25676 115 9001 <<<<*** > 4 ( 256) 128 0 11969 8 0 > >shows some failures. These are not good. Failures, of themselves are not bad, but at least the system no longer panics when a dblock allocation request fails, which it does on release 1.0.6 > >My strategy has been, double the number of allocated dblock classes >until I get no more failures. So in your case, I would double NBLK64 >and NBLK128 each to 256. If failures continue to show up, increase >them again. Also, watch for failures in the streams and queues. > >Our two 2.0.2 systems are set up like this... > > alloc inuse total max fail >streams: 96 40 304 53 0 >queues: 512 216 1702 294 0 >mblocks: 3270 735 4625499 969 0 >dblocks: 2616 735 3954806 969 0 >dblock class: > 0 ( 4) 256 1 138298 7 0 > 1 ( 16) 256 26 560779 130 0 > 2 ( 64) 1024 601 3061270 819 0 > 3 ( 128) 512 104 47186 148 0 > 4 ( 256) 256 0 60700 123 0 > 5 ( 512) 128 0 26377 40 0 > 6 (1024) 64 0 23847 11 0 > 7 (2048) 64 3 36349 8 0 > 8 (4096) 56 0 0 0 0 > >dblock 64 seems to be our biggest headache. I see from this list it is >getting close to overflowing again. > >It has been two days since I lasted booted this machine. The problem is that there is (at least one) memory leak in the kernel. Under certain fairly common circumstances, the kernel will forget about a dblock that it has allocated but no longer needs and just drop it on the floor. We saw this problem in 1.0.6, and gave Interactive a fix for it. Ask them. David J. Fiander, Networking Group, HCR Corporation.