MDelany%hbapn1.prime.com@relay.cs.net (Mark Delany) (05/08/90)
Has anyone else seen Message Queue corruption on XENIX 386 (SysV 2.3.1) when under heavy load, particularly when the system is paging? We're suspicious as ipcs gives strange values for CBYTES and QNUM. To wit: -------------------- Standard IPC package status IPC status from /dev/kmem as of Thu May 3 11:01:02 1990 T ID KEY MODE OWNER GROUP CREATOR CGROUP CBYTES QNUM QBYTES LSPID LRPID STIME RTIME CTIME Message Queues: q 10 0x712806a1 SRrw-rw-rw- cacs group cacs group 65404 65535 1028 389 331 10:50:53 10:50:53 10:29:42 q 11 0x712806a2 -Rrw-rw-rw- cacs group cacs group 40 2 8192 331 389 10:50:53 10:50:53 10:29:42 ... -------------------- and on another occasion -------------------- Standard IPC package status IPC status from /dev/kmem as of Thu May 3 16:12:47 1990 T ID KEY MODE OWNER GROUP CREATOR CGROUP CBYTES QNUM QBYTES LSPID LRPID STIME RTIME CTIME Message Queues: q 20 0x712806a1 SRrw-rw-rw- cacs group cacs group 65445 0 1028 511 446 16:05:24 16:05:24 15:34:09 q 21 0x712806a2 -Rrw-rw-rw- cacs group cacs group 20 1 8192 446 511 16:05:24 16:05:24 15:34:09 ... -------------------- CBYTES and QNUM are 16 bit so it looks pretty much like an underflow problem to me... It only seems to occur when the system is heavily loaded and most likely paging too. Further, the programs in question are making fairly extensive use of Message Q's (as well as shared memory - if that's relevant) and it is highly likely that more than one process is trying to access the same Q at the same time. In other words, if there are any flaws in the locks protecting these structures, then the progs will find them real soon! Once this corruption occurs, all the programs wedge on message Qs. In addition, the system often hangs after this has happened. The only solution we've found so far is to re-boot :-( What I'd like to know is: Has anyone else come across this? Were you able to effect a work-around? Naturally I've already call our supplier for help, but they're an indirect supplier (ie not SCO) and, er, haven't been able to come up with any solution or work-around for us thus far. Thanks.