[comp.unix.xenix] Stuck messages in queues with btrieve

story@can503.UUCP (Robert Story) (09/09/89)

We are having a problem with message queues under Xenix.

Software         - Large financial application written in c
Computer Systems - IBM PS2 Model 70-E21 and IBM PS2 Model 70-121, both with
                   IBM 120MB hard drive, IBM 60MB internal tape drive,
		   eight port Stallion serial board with 3 to 6 terminals,
		   serial printer and parallel printer.
Operating System - SCO Xenix System V 2.2.3
Other Software   - Btrieve Record Manager Version 4.10 (80286 version)
                 - Panel Plus Version 1.00c

The problem seems to arise under heavy load, with 3 to 6 users all running the
financial application and printing documents.  A  process will  msg to btrieve
and then set an alarm for 60 seconds and sit on the msgrcv call.  With a large
load one or two of the processes will  get the  alarm signal.   Examination of
the message queues with ipcs shows messages from/to Btrieve in  the queues but
attempts to read these messages with  msgrcv() and   message type  set to zero
show an empty queue.  A call to msgctl() with IPC_STAT reports messages in the
queue but  the  pointers to  the first  and last  messages are  0.  Subsequent
messaging to btrieve carries on as normal.

We do not believe that the problem is with the '286 version of  btrieve.  When
we went to '386 mode for our code  I changed  the appropriate  int's to shorts
for the interfacing code with btrieve.  This has worked since last November in
this manner.  However, the btrieve people in Austin have just sent us the '386
version over the wire and we will be trying that on Monday.  We had problem in
this area last week but tracked it down  to being  a queue  sizing problem and
have now configured the message queues to be more  than adequate.   Of course,
this problem  is  really messy  because one  can not  buy the  source code for
btrieve and it therefore is a large unknown black box.

Your thoughts would be appreciated.  Please e-mail me.  Thanks.
-- 
[ Robert Story    ..{!utzoo!censor,!uunet!zardoz!avcoint}!avcocan!story     ]
[ SnailMail : AFS 201 Queens Avenue London Ontario Canada N6A 1J1           ]
[        or : AFS 3349 Michelson Drive Irvine California USA 92715-1606     ]
[ Voice     : +1 519 672-4220 xtn 233                                       ]

story@can503.UUCP (Robert Story) (09/20/89)

In article <295@can503.UUCP> I wrote of the following :
>The problem seems to arise under heavy load, with 3 to 6 users all running
>our financial application and printing documents.  A process will msg to 
> btrieve and then set an alarm for 60 seconds and sit on the msgrcv call.  
>With a large load one or two of the processes will get the alarm signal.  
>Examination of the message queues with ipcs shows messages from/to Btrieve in
>the queues but attempts to read these messages with msgrcv() and message type
>set to zero show an empty queue.  A call to msgctl() with IPC_STAT reports 
>messages in the queue but the pointers to the first and last messages are 0.
>Subsequent messaging to btrieve carries on as normal.

We had a person from SCO on site for a week and last Saturday found the
problem.  IT IS a kernel bug.  If the kernel is copying to/from the user's
data area and suffers a page fault then the kernel will put this process to
sleep.  In the meantime another process also using the message queues can
steam through and do its thing.  When the original process wakes up it will
have had its pointers realigned and, of course, weird things begin to happen.
Sometimes the free list turned up on queue 1 or queue 0 turned up on the free
list.  Which explains why ipcs thought that there were messages when there
weren't.  This problem has been fixed in the ATT 3.1 code and the SCO 3.2 code
by using semaphores in the critical areas.

I hope this helps others.  It cost our company a lot of money to discover
this one.  This bug only surfaced before a major release so things were
pretty tense here.  I had a good time, though.  It's not every day I get to
assist in debugging kernel code. If anyone wants more details, please e-mail
me.

-- 
[ Robert Story    ..{!utzoo!censor,!uunet!zardoz!avcoint}!avcocan!story     ]
[ SnailMail : AFS 201 Queens Avenue London Ontario Canada N6A 1J1           ]
[        or : AFS 3349 Michelson Drive Irvine California USA 92715-1606     ]
[ Voice     : +1 519 672-4220 xtn 233                                       ]