[comp.unix.xenix] SCO Xenix System Hang

kessler%cons.utah.edu@wasatch.UUCP (Robert R. Kessler) (12/11/88)

We are having problems with SCO 386 Xenix and are looking for some help.

Here is the scenario:

Our customer is running on an IBM PS 2/80, 20 Mhz, with a Hostess
multiport board.  They run about 6 concurrent users.  We have
installed the latest version of Xenix (2.2.? -- I don't recall which
exactly).

Our applications are all written in RM/COBOL supported by Austec.

Our customer arrives in the morning around 6 am and starts using a
terminal or two.  By 9 am they are up to full strength running all 6
terminals.  When a user runs our software, they all login with the
same user id which starts executing our own user interface shell
(written in COBOL).  It emulates the user interface that we had on our
original system running on TI minis.  The user then selects the
program to run and our shell uses the COBOL CALL statement to call the
program.  I don't believe that it actually forks a process to do this,
though I might be wrong.  Typically, a user seldom logs all the way
out and just goes in and out of programs from our shell.  Some time
after 11:00 am, the display of the various screens all start to slow
down.  Instead of blasting the fields to the screen, it chunks a line
at a time.  If someone can get to a terminal with the login prompt,
they may be able to log in as shutdown and reboot the system.  All
user programs can usually save their data (the applications are all
doing data base like operations, using the key-indexed files provided
by COBOL -- we don't use any external data base facility).  However,
they cannot exit back out of the programs.  They just hang.

If they are successful at rebooting the system, everyone comes back up
and it works just fine.  If not, then the whole system hangs.  The
hang is interesting.  If you have a terminal sitting at the regular sh
prompt, you can type carriage returns and the prompt is echoed.  If
you do any command (ps, shutdown, etc) then it just goes away and
doesn't respond.  You can still type on the terminal, characters are
echoed, but nothing happens.  You can also switch to a different
screen and bring up the system prompt.  However, if you try to type on
this screen, nothing is echoed.

It seems to be related to the amount of work that gets done.  If they
then go for another three hours, it will crap out again.  Our
customers are currently rebooting at 11, 2 and 5 before going home and
running their nightly backup, upkeeping programs.  It is extremely
incovenient.  We have three other customers waiting for their systems,
but we dont want to send them until we can get this problem fixed.

We have contacted SCO a couple of times and Austec, but haven't had
any resolution of the problem (there seems to be some finger pointing
between the two).

Another data point -- I tried to simulate the problem and wrote a
program to CALL a couple of programs and exit, etc.  I eventually did
get code that would always cause the system to hang in exactly the
same way (but it doesn't need to do any calls).  However, I tracked
down the problem and it is some kind of record/file locking problem.
The program that eventually causes it to hang essentually opens, and
writes to a shared file.  It hangs randomly, when one terminal opens
or closes and another writes the file.  We guarantee that they don't
write concurrently to the same records, but it still shouldn't get
into a situation where it hangs the entire system.  The resulting
hang, acts just like the hang at our customer.  However, this hang can
happen in one minute or 3 hours.  It is entirely timing dependent, not
load dependent.  I believe that my program uncovers another bug, and
really isn't what our user is seeing (I tried rewriting the program so
the file isn't shared and installed it at the customer -- it didn't
help since we still have lots of shared files that are used in the
system).  Plus the circumstances of being time varying makes me
believe that it is a different problem, though the result is the same.
(BTW -- the buggy program was run on a COMPAQ 386/20 DeskPRO running
2.3.1 Xenix).

Any help would be greatly appreciated.

Can I write some logging programs to write useful information to a
file that we could examine after a crash?  Is there some system
parameter that I could tweak to alleviate it?

Thanks.
Bob.

debra@alice.UUCP (Paul De Bra) (12/11/88)

In article <766@wasatch.UUCP> kessler%cons.utah.edu@wasatch.UUCP (Robert R. Kessler) writes:
>We are having problems with SCO 386 Xenix and are looking for some help.
>
>Here is the scenario:
>
> [scenario about a system slowly dying (or hanging) deleted]

This story sounds like it is a memory allocation problem. I looks like the
program is slowly but surely allocating more and more memory. The system
then needs more and more time to free up memory for additional processes.

My suggestion: Let all 6 users start working and then perform a "ps -el"
about every 5 minutes and watch the size of the program. It may not even
take 3 hours to reach the problem as the system will now have the additional
problem of finding memory to run the ps as well.

Paul.
-- 
------------------------------------------------------
|debra@research.att.com   | uunet!research!debra     |
------------------------------------------------------

jbayer@ispi.UUCP (Jonathan Bayer) (12/12/88)

In article <766@wasatch.UUCP>, kessler%cons.utah.edu@wasatch.UUCP (Robert R. Kessler) writes:
> We are having problems with SCO 386 Xenix and are looking for some help.
> 
> Here is the scenario:
> 
> Our customer is running on an IBM PS 2/80, 20 Mhz, with a Hostess
> multiport board.  They run about 6 concurrent users.  We have
> installed the latest version of Xenix (2.2.? -- I don't recall which
> exactly).
> 
> Our applications are all written in RM/COBOL supported by Austec.
> 
> Our customer arrives in the morning around 6 am and starts using a
> terminal or two.  By 9 am they are up to full strength running all 6
> terminals.  When a user runs our software, they all login with the
> same user id which starts executing our own user interface shell
   ^^^^^^^^^^^
This is a very bad idea.  We had a customer who did the same thing.  The
system would hang if too many people would log on using the same id.
Make sure they use  different ids.  Otherwise it seems that internal
system tables  are getting filled up.

Also, how much memory is in the system?  We figure that the base system
needs about 1 meg, and then we add 512k for each additional terminal.  It
seems to work.   As I understand RM/Cobol is a memory hog.


Jonathan Bayer
Intelligent Software Products, Inc.

bill@bilver.UUCP (bill vermillion) (12/13/88)

In article <766@wasatch.UUCP> kessler%cons.utah.edu@wasatch.UUCP (Robert R. Kessler) writes:
>We are having problems with SCO 386 Xenix and are looking for some help.
>
>Here is the scenario:
>
...... (deleted much relating to system slowdown.)....
>         .  When a user runs our software, they all login with the
>same user id which starts executing our own user interface shell

I have seen this occur on older Xenix systems - Tandy 6000 comes to mind.
System would grind to a halt.  The problem was that someone decided all users
should have the same login id.  When everyone used their own logins things
became tolerable again. (As tolerable as 6mhz multiuser 68000k machine can
be.)

>hang is interesting.  If you have a terminal sitting at the regular sh
>prompt, you can type carriage returns and the prompt is echoed.  If
>you do any command (ps, shutdown, etc) then it just goes away and
>doesn't respond.  You can still type on the terminal, characters are
>echoed, but nothing happens.  You can also switch to a different
>screen and bring up the system prompt.  However, if you try to type on
>this screen, nothing is echoed.

Sounds like it might be being "swapped to death" and/or run out of processes.
Still can be related to "one user" loggin in multiple times.

>Any help would be greatly appreciated.

Just a thought.  It might not be applicable in your case.

-- 
Bill Vermillion - UUCP: {uiucuxc,hoptoad,petsd}!peora!rtmvax!bilver!bill
                      : bill@bilver.UUCP

kessler%cons.utah.edu@wasatch.UUCP (Robert R. Kessler) (12/13/88)

Oops, this should have gone to the whole mailing list:
To: jbayer@ispi.UUCP
Subject: Re: SCO Xenix System Hang
Newsgroups: comp.unix.xenix
In-Reply-To: <347@ispi.UUCP>
References: <766@wasatch.UUCP>
Organization: University of Utah, Computer Science Dept.
Cc: 
Bcc: 

In article <347@ispi.UUCP> you write:
>In article <766@wasatch.UUCP>, kessler%cons.utah.edu@wasatch.UUCP (Robert R. Kessler) writes:
>> We are having problems with SCO 386 Xenix and are looking for some help.
>> 
[Lines deleted]
>> terminals.  When a user runs our software, they all login with the
>> same user id which starts executing our own user interface shell
>   ^^^^^^^^^^^
>This is a very bad idea.  We had a customer who did the same thing.  The
>system would hang if too many people would log on using the same id.

Did you see the same kind of hang that I outlined?  Where it just died
after a while, or did it hang when someone tried to login?

>Make sure they use  different ids.  Otherwise it seems that internal
>system tables  are getting filled up.

As an experiment, that would be easy to try.

>
>Also, how much memory is in the system?  We figure that the base system
>needs about 1 meg, and then we add 512k for each additional terminal.  It
>seems to work.   As I understand RM/Cobol is a memory hog.
>
That particular system has 4 Meg.  The COMPAQ that I was testing the
COBOL program that hung with concurrent access has 5 Meg (and I could
hang it with only 2 users).
>
>Jonathan Bayer
>Intelligent Software Products, Inc.

By the way -- as suggested by Paul De Bra, I installed a script which
slept for 5 minutes and then dumped out a ps -el into a file.  It was
installed and then ran waiting for the system to hang.  It ran for 6
hours without a system hang -- a new record for length of time under
full usage.  They had to reboot the system to leave for the evening,
so it never did hang.  We are rerunning the experiment today.  Boy,
what a strange coincidence.

Bob.

daveh@marob.MASA.COM (Dave Hammond) (12/15/88)

>In article <329@bilver.UUCP> bill@bilver.UUCP (bill vermillion) writes:
>We are having problems with SCO 386 Xenix and are looking for some help.
>[...]
>         .  When a user runs our software, they all login with the
>same user id which starts executing our own user interface shell
 ^^^^^^^^^^^^
 (problem lies here)

>hang is interesting.  If you have a terminal sitting at the regular sh
>prompt, you can type carriage returns and the prompt is echoed.  If
>you do any command (ps, shutdown, etc) then it just goes away and
>doesn't respond.  You can still type on the terminal, characters are
>echoed, but nothing happens.  You can also switch to a different
>screen and bring up the system prompt.  However, if you try to type on
>this screen, nothing is echoed.

You have exceeded the system-imposed open files limit (_NFILES in stdio.h)
for a single user-id.  The only solution is to assign individual user
accounts to each of your users.

The shell prompt will continue to be issued, as no files are opened by
simply pressing Return.  However, as soon as a valid (non-builtin) command
is issued, the shell must fork/exec the child program, which involves
opening more files (the program binary itself, at the very least) and
fails due to the afore-mentioned file table overflow.

BTW, unless there is a Real Good Reason for having multiple users with the
same user-id, it is far more appropriate for security, file management and
system administration purposes to assign individual user-ids.

--
Dave Hammond
...!uunet!masa.com!{marob,dsix2}!daveh

ruediger@ramz.UUCP (Ruediger Helsch) (12/16/88)

In article <329@bilver.UUCP> bill@bilver.UUCP (bill vermillion) writes:
>In article <766@wasatch.UUCP> kessler%cons.utah.edu@wasatch.UUCP (Robert R. Kessler) writes:
>>                  You can still type on the terminal, characters are
>>echoed, but nothing happens.  You can also switch to a different
>>screen and bring up the system prompt.  However, if you try to type on
>>this screen, nothing is echoed.
>
>Sounds like it might be being "swapped to death" and/or run out of processes.
>Still can be related to "one user" loggin in multiple times.
>

Our system (Xenix release 2.2.1 on an AT) showed the same behaviour when i
once tried to run it out of memory (repeated forking or repeated malloc()ing).
Malloc() of fork() should eventually return an error (after swapping 
everything out, what i wanted to do). But they don't. They just seem to try 
forever.

It seems to me that request for memory can't fail in Xenix. Of course they
also can't return if memory limits are exceeded. They just hang there, and
the system does its best searching for memory, swapping round and round.

Possibly the COBOL runtime system requests more and more memory, without
ever freeing it. After some hours they exceed the available memory and
swapping starts.

dyer@spdcc.COM (Steve Dyer) (12/16/88)

In article <418@marob.MASA.COM> daveh@marob.masa.com (Dave Hammond) writes:
>>[description of a hung system]
>
>You have exceeded the system-imposed open files limit (_NFILES in stdio.h)
>for a single user-id.  The only solution is to assign individual user
>accounts to each of your users.  The shell prompt will continue to be issued,
>as no files are opened by simply pressing Return.  However, as soon as a
>valid (non-builtin) command is issued, the shell must fork/exec the child
>program, which involves opening more files (the program binary itself,
>at the very least) and fails due to the afore-mentioned file table overflow.

I'm afraid that you have things a little mixed up.

There is a global (per-system) limit on the maximum number of files which
may be open at once.  In most versions of UNIX, objects of type "struct file"
are allocated from a global array, "file[NFILE]", where NFILE may be a
compile-time manifest constant or a parameter subject to tuning used at boot
time for memory allocation.  Attempting to run over this generates a
kernel printf "file table overflow" on the console of most versions of UNIX,
and the offending system call to return -1 with an errno of ENFILE.

There is a per-process limit on the maximum number of open files, usually
referred to as NOFILES, and this ranges from 20-60 or more, depending on
your variant of UNIX.  Hopefully, _NFILE (from stdio.h) will agree with
NOFILES, though I have come across flavors where the kernel folks weren't
talking to the library folks.  An open/dup/creat system call will return
-1 with errno set to EMFILE.

There is a limit on the maximum number of simultaneously-running processes
for a non-root user id, known as MAXUPRC, which is usually something like 25.
fork() returns -1 with errno set to EAGAIN if MAXUPRC is reached.

gordon@sneaky.TANDY.COM (Gordon Burditt) (12/19/88)

The described "hang": having the system run very slowly, but having
different users use different logins fixes or reduces the problem,
is caused by the per-user-id process limit.

There are several features that contribute to this:

1.  There is a limit to the number of processes a non-root user may
    have running at one time, called MAXUPRC.  The default is probably
    something like 15.  If you can re-link the kernel, you can probably 
    raise this limit.  (I am working from an old version (non-*86) of
    SCO Xenix System III, so some of this may have changed).  Raising this
    limit does not increase the size of any tables.  If you need lots of
    processes (and this problem will exist regardless of what uid's they
    run under), you may need to increase the number of entries for processes,
    open files, and inodes.  If you run out of open files or inodes, you
    get cryptic console messages like "no file".  If you run out of
    system processes, you may get error messages (csh), or just lots of
    retrying (sh).

    Fix:  don't run everything under the same uid, and/or raise MAXUPRC.

2.  If the "sh" shell gets an error on a "fork" due to running over the
    MAXUPRC limit, it retries.  Forever, unless interrupted by a signal.
    For a quick test of this, log in, then type sh<RETURN> repeatedly.
    After about 15 or 20 times, you won't get another prompt.  Use your
    interrupt character to unlock the terminal.  Then type lots of
    control-D's to get rid of all those shells.  

    Now, imagine three users logged in under the same user id.  Each
    has 5 processes, and is trying to create a 6th with sh.  None of them 
    will get any work done until one of them aborts the retries and
    terminates that shell.  Whether or not the interrupt character can
    do this, and whether trying will destroy data, depends on the application.

    The scenario this retrying is supposed to handle is waiting for another,
    independent job started from the same terminal to release its processes
    (say, a mailer doing background delivery) without requiring more.

    Fix:  Try to arrange jobs to not require deep nesting of processes.

3.  To further complicate the situation, some applications don't wait for
    their children.  A process occupies a process slot until its parent
    (or if its parent dies, its foster parent, process 1) waits for it.
    If an application keeps spinning off background print jobs, and never
    waits for them to finish, eventually it will hit MAXUPRC or run the
    system out of processes.  These will show up on a ps as zombie processes 
    with parents other than process 1.

    A compromise for this might be to allow one outstanding background
    job, and after spinning off the second one, wait for one of them
    to finish.  Also, doing a wait() with an alarm() set can pick up
    already-terminated processes without waiting for all of them.

    Fix:  applications should wait for their children.  (Also check
    the status and report problems!)

4.  This one gets a little exotic, and may be specific to the system I
    am using.  It doesn't apply to systems that do paging instead of swapping.
    It also isn't related to running everything under one user id.
    There is a limit to the maximum amount of memory a process can use at 
    one time in the kernel.  Suppose that the kernel is re-configured to raise 
    this limit to above the amount of memory available.  (Limit > physical
    memory - kernel memory.  "Available memory" means the maximum amount of
    memory a process can get without hanging the system, after administrative
    restrictions are raised.) Now, have an application program request 110% of 
    available memory.  This request will fail.  Have an application request 
    100% of available memory plus one allocation unit.  This request doesn't 
    fail (but it should).  The process gets swapped out and tries to swap
    back in.  In the process, the swapper swaps everything else out.
    You can't kill the huge process because it needs to swap in to die.
    Something else running may lock up behind this process, or it may
    run, but slowly because it keeps getting swapped out.

    The fix for this is to not let processes get away with requesting that
    much memory.  The easiest way is to lower the "administrative limit"
    maxprocmem.  This may not be present in System V, or it may exist in
    another form.
					Gordon L. Burditt
					...!texbell!sneaky!gordon

domo@riddle.UUCP (Dominic Dunlop) (12/19/88)

In article <418@marob.MASA.COM> daveh@marob.masa.com (Dave Hammond) writes:
>>In article <329@bilver.UUCP> bill@bilver.UUCP (bill vermillion) writes:
>>[...]
>>         .  When a user runs our software, they all login with the
>>same user id which starts executing our own user interface shell
> ^^^^^^^^^^^^
> (problem lies here)
>...
>BTW, unless there is a Real Good Reason for having multiple users with the
>same user-id, it is far more appropriate for security, file management and
>system administration purposes to assign individual user-ids.

Absolutely.  Bill can probably achieve the effect he desires in terms of
file access permissions by putting all of the users of his application
into a single, dedicated group -- let's call it billsapp.  This needs a
line like

	300::billsapp:appuser1,appuser2,appuser3,appuser4

in /etc/group, together with 300 (or whatever group id he assigns) in the
fourth field of the /etc/passwd entry for each of those users.  For example

	appuser3:ecBxt/7Df8qlf:123:300:Bill's third user:/u/app/:/bin/sh

The SCO Xenix mkuser (user administration) tools could help Bill to set
things up in this way, except that, as I recall (can't run it up just now
-- the closest system's sufficiently secure that I don't know the root
password), it does not like assigning a new user to an existing home
directory, an action which is perfectly proper in a case such as this.
So Bill will probably have to do things by hand (SCO, please note).

This done, all files used by the application should be fixed to have write
permission for group.  Directories used for such things as scratch and
spool files need group write permission too.  Often, it's sufficient to do
this set-up by hand as part of installation, as many applications don't
need to _create_ shareable files when they run; they simply need to update
existing shareable files, and to create user-private scratch files.  Where
shareable files must be created -- for example, if one user generates a
print spooling file which can subsequently be reprinted by another user --
the application must create a shareable file.  The best way to do this is
to modify the application so that it explicitly gives a suitable file
permissions mask value to the corresponding creat() or write() call.  To be
absolutely certain of getting a particular set of permissions, whatever the
value of the user's file creation permissions mask (umask), it's necessary
to call chmod() after file creation.  However, it's often sufficient to set
up a suitable umask in the .profile script run by /bin/sh when a user logs
into the system, provided that calls on open() and creat() in the
application specify permissive creation permissions -- typically 0666.
(Most programs, including BASIC, COBOL, and ``fourth-generation'' run-time
systems, do this).  For example

	umask 0660			# read/write for user, group
	PATH=/u/billsbin:$PATH export PATH
	cd /u/billsapp/billsdata
	exec billsapp

in /u/billsapp/.profile will throw any user whose home directory is
/u/billsapp (and whose login shell is /bin/sh) straight into the
application after setting up a suitable file creation mask, executable
search path, and current directory.  My company uses a procedure very like
this to run up an accounting package which we bought in binary form (and so
couldn't modify), and which was very insecure straight out of the box.
(All the permissions were wide-open, so anybody could do anything, whether
they were an accounts person or not.)

The method I've outlined is probably sufficient for most applications.
However, there are times when more sophistication is required -- for
example, the case where an application needs to write a file which can be
accessed only by the application, not even by UNIX shell commands run by
the user who created the file.  In such cases, you need to get involved
with set-group-id programs.  I'd counsel against doing this unless you
really need to, as

  a) They're surprisingly difficult to get right (AT&T's mail gets it
     right; AT&T's lp gets it wrong); and

  b) It's a good way to write applications which are not easily portable
     between AT&T and Berkeley variants of UNIX (Xenix is in the AT&T camp
     on this issue).

To read further... _UNIX System Security_ by Wood & Kochan (Hayden, 1985,
ISBN 0-8104-6267-2) is a good place to start.

Hope this helps.
-- 
Dominic Dunlop
domo@sphinx.co.uk  domo@riddle.uucp

chip@vector.UUCP (Chip Rosenthal) (12/20/88)

How about checking the simple stuff first?  Is your 386 an old 15MHz CPU?
If so, you might be hitting the infamous "Errata 20" bug, and in which case
SCO has a fix disk available.
-- 
Chip Rosenthal     chip@vector.UUCP    |      Choke me in the shallow water
Dallas Semiconductor   214-450-5337    |         before I get too deep.

steven@lakesys.UUCP (Steven Goodman) (12/23/88)

In article <671@vector.UUCP> chip@vector.UUCP (Chip Rosenthal) writes:
>How about checking the simple stuff first?  Is your 386 an old 15MHz CPU?
>If so, you might be hitting the infamous "Errata 20" bug, and in which case
>SCO has a fix disk available.
>-- 

	If this bug appears and you happen to have a Compaq, contact your
dealer.  Compaq will ( free of charge ) add a small board which plugs into
your 386 socket which will fix the "errata 20" bug.  

Not wanting to repeat what might have alreading been said on this subject but
this bug only accurs when using a 387 chip and you can boot up your system
to ignore the 387 until you receive some kinda fix.  From what I understand
this is NOT fixable via software.

-- 
Steven M. Goodman
Lake Systems  -  Milwaukee, Wisconsin		uunet!marque!lakesys!steven
						uwvax!uwmcsd1!lakesys!steven