NU013809@NDSUVM1.BITNET (Greg Wettstein) (03/14/91)
Our group is being plagued by segmentation faults (Signal 11) and I am wondering whether anyone else has experienced similar problems. I originally attributed the problem to faulty hardware but I am now beginning to entertain other causality. The original problem was experienced on an ALR 386/220 with 6 Mbyte of memory and I am convinced that this was due in large part to faulty memory. Every ALR machine we have touched in this group has had memory problems which appear to be related to DMA problems. Even motherboard replacements which were 'guaranteed to fix the problem' failed to stop them. Besides generating segmentation faults these faults would occassionally bring the machine to an instant devastating halt. At this point we were running XENIX 2.3.2 with the UFJ update for VPIX yielding a kernel release of 2.3.3. The problem got so bad that it became necessary to change out the hardware. The machine we opted for was a Gateway 2000 running at 33 Mhz with 8 Mbyte of memory and 64 kbyte of cache. The disk system is a CompuAdd cacheing ESDI disk controller (2 Mbyte) connected to a 350 Mbyte Wren. Also in the machine is a Mountain tape card, a Multitech 224EC internal modem and a 12 port Equinox Megaport card. The drive and controller came from the the old machine and when we cranked up the system it made it through the boot sequence but crashed with a memory error when I tried to login. The first thing that went through my mind was that the memory board is probably loose. The Gateway 2000 is based on a Micronix motherboard which has all its memory sitting on a card which plugs into a special 32 bit slot on the motherboard. I pulled the memory board, reseated it and started up again. This time I logged in but experienced numerous segmentation faults and two crashes as I tested the machine for the rest of the afternoon. I seemed to stabilize the next day (Friday) but when I came in on Monday morning it was sitting dead with a kernal panic on the screen. We have now been using the machine for a month and while performance is excellent we are still experiencing enough in the way of segmentation faults and an occassional panic that we cannot put the machine in production. Yesterday I installed the xnx155b upgrade thinking that there may be a fix embedded somewhere in there that would solve my problems. I had noted that a couple of the SLS upgrades available on the sco-archive directory on uunet made mention of the fact that they corrected problems when XENIX was run on various different types of motherboards. The machine ran through the night but this morning when I fired up emacs it persistently gave me segmentation faults. I ran shutdown from root and rebooted (without cycling power) and things were fine. I am presently in a quandary whether to call in the technical support people and claim memory problems or look for OS problems. I have written C programs which malloc large blocks (>1 Mbyte) of memory and fill/refill these blocks in various combinations, reallocing etc trying to flush memory faults but I cannot seem to consistently force failures. I have spawned several of these until the machine was forced into severe swapping to the point where the swap area was completely filled and generated no panics. A little while later I will be halfway though a large set of compilations and gcc will dump aftering catching a segmentation fault signal. I am getting rid to pull the Equinox card and its drivers out to see whether or not they could be the root of the problem. This card has performed flawlessly so I am not very quick to point a finger at it. I re-jumpered the motherboard to disable cacheing of the memory region which the card maps its buffer and control blocks into so that should not be a problem. The only reason I suspect an interrupt problem is that when uusched kicks up uucico to poll one of the neigboring sites uucicio will fail and dump core presumably due to a segmentation fault. Occassionally when a neighboring site calls in to poll us a similar event will happen. I should mention the fact that the modem in question is on a serial port not one of the Equinox ports. But when one is chasing ghosts all corners should get investigated... I would be interested in whatever commentary the net is willing to offer. There are bunches of XENIX sites out there so I am hoping that somebody may have experienced this problem. If there are no experiences with this type of phenomemon then I have to turn the heat up on Gateway. My boss keeps telling me, "But they've burned these machines in, how can we have any problems....". 'Tiz a protected mode operating system my friend....' Any information would, as always, be deeply appreciated. As always, Dr. G.W. Wettstein Oncology Research Division Computing Facility Fargo Clinic / MeritCare UUCP: uunet!plains!wind!greg INTERNET: greg%wind.uucp@plains.nodak.edu Phone: 701-234-2833 `The truest mark of a man's wisdom is his ability to listen to other men expound their wisdom.'