[comp.unix.xenix] 2.3.1 text corruption

carlson@lll-winken.LLNL.GOV (Joe Carlson) (06/04/89)

        I am experiencing a rather odd corruption problem under XENIX 386
2.3.1.  Basically it appears that the in-core version of certain heavily
used programs appears to get corrupted every once in a while. I believe that
I have eliminated hardware trouble as the cause of this.
        The steps I took to isolate this were: 
        1.) /bin/sh starts to core dump when run
        2.) mv /bin/sh to /bin/sh.old
        3.) cp /bin/sh.old to /bin/sh
	4.) /bin/sh now works correctly (/bin/sh.old still core dumps)
	5.) cmp -l /bin/sh /bin/sh.old shows no diffs

	It appears that there is a text cache kept in the kernel which
has corrupted it's entry for program in question.... rebooting the system
clears the trouble (or making another disk copy as I show above).  The problem
appears VERY sporadically and has hit a number of programs on the system
(/bin/sh and /etc/nntpd[one of my own]) that get run a lot.

Joe Carlson	carlson@lll-winken.llnl.gov	(415)-422-5584
 

sl@unifax.UUCP (Stuart Lynne) (06/04/89)

In article <26353@lll-winken.LLNL.GOV> carlson@lll-winken.LLNL.GOV (Joe Carlson) writes:
}2.3.1.  Basically it appears that the in-core version of certain heavily
}used programs appears to get corrupted every once in a while. I believe that
}I have eliminated hardware trouble as the cause of this.

}	It appears that there is a text cache kept in the kernel which
}has corrupted it's entry for program in question.... rebooting the system
}clears the trouble (or making another disk copy as I show above).  The problem
}appears VERY sporadically and has hit a number of programs on the system
}(/bin/sh and /etc/nntpd[one of my own]) that get run a lot.

I have also seen this problem on another system with a flakey swap area.
You might want to check that you don't have any bad blocks in your swap
area.
-- 
Stuart.Lynne@wimsey.bc.ca uunet!van-bc!sl 604-937-7532(voice) 604-939-4768(fax)

karl@ddsw1.MCS.COM (Karl Denninger) (06/05/89)

In article <26353@lll-winken.LLNL.GOV> carlson@lll-winken.LLNL.GOV (Joe Carlson) writes:
>
>        I am experiencing a rather odd corruption problem under XENIX 386
>2.3.1.  Basically it appears that the in-core version of certain heavily
>used programs appears to get corrupted every once in a while. I believe that
>I have eliminated hardware trouble as the cause of this.

I wouldn't be so sure.

This showed up last night after adding a second controller and disk drive to
our system here (2.3.2).  The symptoms were 'cc' core dumping with a
segmentation violation, wierd unreproducable problems with the tape drive,
and even as kernel panic (TRAP in system mode!)

I am going to try moving boards around and see if it goes away.  We have
never seen this one before; it's brand new and just started happening after
we added the secondary controller.  It may be related to board position; we
had to move things around to get the second board in the box, and the system
is stuffed now.  I will be moving the tape drive controller away from the
disk controllers, and in general trying a few things over the next few days.

>	It appears that there is a text cache kept in the kernel which
>has corrupted it's entry for program in question.... rebooting the system
>clears the trouble (or making another disk copy as I show above).  The problem
>appears VERY sporadically and has hit a number of programs on the system
>(/bin/sh and /etc/nntpd[one of my own]) that get run a lot.

I've seen this happen with the tape drive too; it seems tied to heavy load
and secondary controller activity.  It's not yet been isolated here, but I
will report to the net when we have something more concrete.

--
Karl Denninger (karl@ddsw1.MCS.COM, <well-connected>!ddsw1!karl)
Public Access Data Line: [+1 312 566-8911], Voice: [+1 312 566-8910]
Macro Computer Solutions, Inc.		"Quality Solutions at a Fair Price"

wgb@tntdev.tnt.COM (William G. Bunton) (06/06/89)

In article <3571@ddsw1.MCS.COM> karl@ddsw1.MCS.COM (Karl Denninger) writes:

> In article <26353@lll-winken.LLNL.GOV> carlson@lll-winken.LLNL.GOV (Joe Carlson) writes:
> >
> >        I am experiencing a rather odd corruption problem under XENIX 386
> >2.3.1.  Basically it appears that the in-core version of certain heavily
> >used programs appears to get corrupted every once in a while. I believe that
> >I have eliminated hardware trouble as the cause of this.
> 
> I wouldn't be so sure.

I wouldn't be either. I saw some really strange things happening on a
Motorola machine. I would be editing in MicroEmacs, and would suddenly
find myself in vi instead. Turned out the system had a bad cache
controller board. So I would check your cache controller if you have
cache memory.

Bill
--
William G. Bunton            wgb@tntdev.tnt.com     {uunet,natinst}!tntdev!wgb
Tools & Techniques, Inc. Austin, TX        

stu@jpusa1.UUCP (Stu Heiss) (06/06/89)

In article <133@unifax.UUCP> sl@unifax.UUCP (Stuart Lynne) writes:
-In article <26353@lll-winken.LLNL.GOV> carlson@lll-winken.LLNL.GOV (Joe Carlson) writes:
-}2.3.1.  Basically it appears that the in-core version of certain heavily
-}used programs appears to get corrupted every once in a while. I believe that
-}I have eliminated hardware trouble as the cause of this.
-
-}	It appears that there is a text cache kept in the kernel which
-}has corrupted it's entry for program in question.... rebooting the system
-}clears the trouble (or making another disk copy as I show above).  The problem
-}appears VERY sporadically and has hit a number of programs on the system
-}(/bin/sh and /etc/nntpd[one of my own]) that get run a lot.
-
-I have also seen this problem on another system with a flakey swap area.
-You might want to check that you don't have any bad blocks in your swap
-area.

I have also observed this but never considered the possibility of a disk
problem.  I do recall some discussion about bad track remapping not
working for the swap area.  Is this related or does anyone from sco have
any further info?
-- 
Stu Heiss - gargoyle.uchicago.edu!jpusa1.uucp!stu, stu@jpusa1.chi.il.us

karl@ddsw1.MCS.COM (Karl Denninger) (06/08/89)

In article <1124@jpusa1.UUCP> stu@jpusa1.chi.il.us (Stu Heiss,6312,6334,) writes:
>In article <133@unifax.UUCP> sl@unifax.UUCP (Stuart Lynne) writes:
>-In article <26353@lll-winken.LLNL.GOV> carlson@lll-winken.LLNL.GOV (Joe Carlson) writes:
>-}2.3.1.  Basically it appears that the in-core version of certain heavily
>-}used programs appears to get corrupted every once in a while. I believe that
>-}I have eliminated hardware trouble as the cause of this.
>-
>-I have also seen this problem on another system with a flakey swap area.
>-You might want to check that you don't have any bad blocks in your swap
>-area.
>
>I have also observed this but never considered the possibility of a disk
>problem.  I do recall some discussion about bad track remapping not
>working for the swap area.  Is this related or does anyone from sco have
>any further info?

I have checked into this, and it's not the problem.  If it was, I would
expect to see a disk error message preceeding the problems -- that has 
never occurred here.

We saw the problem too, but worse.  Not only would I get wierd crashes from
some programs, but also TRAP IN SYSTEM MODE panics!  Moving around a couple 
of boards seems to have fixed it.  If you have halfway flakey hardware, watch 
out -- you'll get all kinds of wierd problems, none of which your POST or 
diags will catch!

I believe that the tape controller was interfering with the disk controller
-- since moving the tape controller to a slot away from the drive
controllers we haven't seen the problem recur....

Check your hardware -- carefully.  I'll keep the net posted if the gremlins
come back to 'ddsw1'..... So far we're two days and counting without a
problem under heavy load.  

All this started here when I added a second controller and third fixed disk,
and put the controller too close to the tape controller board (an archive
controller... guess it's noisy or something).

The problem that appeared to be SCO not remapping bad sectors in the swap
area turned out to be a SECOND bad sector in the swap area!  We mapped that
one out too, and now all is ok in that regard -- no more fixed disk errors.

Btw: The second controller support works beautifully, and the system appears
     to multithread I/O requests with two boards in there (ie: both disk
     access lights are on at the same time!!)  Nice job SCO!

--
Karl Denninger (karl@ddsw1.MCS.COM, <well-connected>!ddsw1!karl)
Public Access Data Line: [+1 312 566-8911], Voice: [+1 312 566-8910]
Macro Computer Solutions, Inc.		"Quality Solutions at a Fair Price"

romwa@gpu.utcs.utoronto.ca (Royal Ontario Museum) (06/12/89)

In article <3576@ddsw1.MCS.COM> karl@ddsw1.MCS.COM (Karl Denninger) writes:
>In article <1124@jpusa1.UUCP> stu@jpusa1.chi.il.us (Stu Heiss,6312,6334,) writes:
>>In article <133@unifax.UUCP> sl@unifax.UUCP (Stuart Lynne) writes:
>>-In article <26353@lll-winken.LLNL.GOV> carlson@lll-winken.LLNL.GOV (Joe Carlson) writes:
>>-}2.3.1.  Basically it appears that the in-core version of certain heavily
>>-}used programs appears to get corrupted every once in a while. I believe that
>>-}I have eliminated hardware trouble as the cause of this.
>>-
>>-I have also seen this problem on another system with a flakey swap area.
>>-You might want to check that you don't have any bad blocks in your swap
>>-area.
>>
>>I have also observed this but never considered the possibility of a disk
>>problem.  I do recall some discussion about bad track remapping not
>>working for the swap area.  Is this related or does anyone from sco have
>>any further info?
>
>I have checked into this, and it's not the problem.  If it was, I would
>expect to see a disk error message preceeding the problems -- that has 
>never occurred here.
>
>We saw the problem too, but worse.  Not only would I get wierd crashes from
>some programs, but also TRAP IN SYSTEM MODE panics!  Moving around a couple 
>of boards seems to have fixed it.  If you have halfway flakey hardware, watch 
>out -- you'll get all kinds of wierd problems, none of which your POST or 
>diags will catch!
>
>I believe that the tape controller was interfering with the disk controller
I'm not sure if this is related to the above problem, but I,
too, have seen some weird things happen with 2.3.1.  In
particular, twice now I have had parts of disk files show up
in mail messages.  The first case was a record from a Foxbase
file and the second was some text information.  

I'll keep my eye on the hardware.


Mark T. Dornfeld
Royal Ontario Museum
100 Queens Park
Toronto, Ontario, CANADA
M5S 2C6

mark@utgpu!rom      - or -     romwa@utgpu