kessler%cons.utah.edu@wasatch.UUCP (Robert R. Kessler) (12/11/88)
We are having problems with SCO 386 Xenix and are looking for some help. Here is the scenario: Our customer is running on an IBM PS 2/80, 20 Mhz, with a Hostess multiport board. They run about 6 concurrent users. We have installed the latest version of Xenix (2.2.? -- I don't recall which exactly). Our applications are all written in RM/COBOL supported by Austec. Our customer arrives in the morning around 6 am and starts using a terminal or two. By 9 am they are up to full strength running all 6 terminals. When a user runs our software, they all login with the same user id which starts executing our own user interface shell (written in COBOL). It emulates the user interface that we had on our original system running on TI minis. The user then selects the program to run and our shell uses the COBOL CALL statement to call the program. I don't believe that it actually forks a process to do this, though I might be wrong. Typically, a user seldom logs all the way out and just goes in and out of programs from our shell. Some time after 11:00 am, the display of the various screens all start to slow down. Instead of blasting the fields to the screen, it chunks a line at a time. If someone can get to a terminal with the login prompt, they may be able to log in as shutdown and reboot the system. All user programs can usually save their data (the applications are all doing data base like operations, using the key-indexed files provided by COBOL -- we don't use any external data base facility). However, they cannot exit back out of the programs. They just hang. If they are successful at rebooting the system, everyone comes back up and it works just fine. If not, then the whole system hangs. The hang is interesting. If you have a terminal sitting at the regular sh prompt, you can type carriage returns and the prompt is echoed. If you do any command (ps, shutdown, etc) then it just goes away and doesn't respond. You can still type on the terminal, characters are echoed, but nothing happens. You can also switch to a different screen and bring up the system prompt. However, if you try to type on this screen, nothing is echoed. It seems to be related to the amount of work that gets done. If they then go for another three hours, it will crap out again. Our customers are currently rebooting at 11, 2 and 5 before going home and running their nightly backup, upkeeping programs. It is extremely incovenient. We have three other customers waiting for their systems, but we dont want to send them until we can get this problem fixed. We have contacted SCO a couple of times and Austec, but haven't had any resolution of the problem (there seems to be some finger pointing between the two). Another data point -- I tried to simulate the problem and wrote a program to CALL a couple of programs and exit, etc. I eventually did get code that would always cause the system to hang in exactly the same way (but it doesn't need to do any calls). However, I tracked down the problem and it is some kind of record/file locking problem. The program that eventually causes it to hang essentually opens, and writes to a shared file. It hangs randomly, when one terminal opens or closes and another writes the file. We guarantee that they don't write concurrently to the same records, but it still shouldn't get into a situation where it hangs the entire system. The resulting hang, acts just like the hang at our customer. However, this hang can happen in one minute or 3 hours. It is entirely timing dependent, not load dependent. I believe that my program uncovers another bug, and really isn't what our user is seeing (I tried rewriting the program so the file isn't shared and installed it at the customer -- it didn't help since we still have lots of shared files that are used in the system). Plus the circumstances of being time varying makes me believe that it is a different problem, though the result is the same. (BTW -- the buggy program was run on a COMPAQ 386/20 DeskPRO running 2.3.1 Xenix). Any help would be greatly appreciated. Can I write some logging programs to write useful information to a file that we could examine after a crash? Is there some system parameter that I could tweak to alleviate it? Thanks. Bob.
debra@alice.UUCP (Paul De Bra) (12/11/88)
In article <766@wasatch.UUCP> kessler%cons.utah.edu@wasatch.UUCP (Robert R. Kessler) writes: >We are having problems with SCO 386 Xenix and are looking for some help. > >Here is the scenario: > > [scenario about a system slowly dying (or hanging) deleted] This story sounds like it is a memory allocation problem. I looks like the program is slowly but surely allocating more and more memory. The system then needs more and more time to free up memory for additional processes. My suggestion: Let all 6 users start working and then perform a "ps -el" about every 5 minutes and watch the size of the program. It may not even take 3 hours to reach the problem as the system will now have the additional problem of finding memory to run the ps as well. Paul. -- ------------------------------------------------------ |debra@research.att.com | uunet!research!debra | ------------------------------------------------------
jbayer@ispi.UUCP (Jonathan Bayer) (12/12/88)
In article <766@wasatch.UUCP>, kessler%cons.utah.edu@wasatch.UUCP (Robert R. Kessler) writes: > We are having problems with SCO 386 Xenix and are looking for some help. > > Here is the scenario: > > Our customer is running on an IBM PS 2/80, 20 Mhz, with a Hostess > multiport board. They run about 6 concurrent users. We have > installed the latest version of Xenix (2.2.? -- I don't recall which > exactly). > > Our applications are all written in RM/COBOL supported by Austec. > > Our customer arrives in the morning around 6 am and starts using a > terminal or two. By 9 am they are up to full strength running all 6 > terminals. When a user runs our software, they all login with the > same user id which starts executing our own user interface shell ^^^^^^^^^^^ This is a very bad idea. We had a customer who did the same thing. The system would hang if too many people would log on using the same id. Make sure they use different ids. Otherwise it seems that internal system tables are getting filled up. Also, how much memory is in the system? We figure that the base system needs about 1 meg, and then we add 512k for each additional terminal. It seems to work. As I understand RM/Cobol is a memory hog. Jonathan Bayer Intelligent Software Products, Inc.
bill@bilver.UUCP (bill vermillion) (12/13/88)
In article <766@wasatch.UUCP> kessler%cons.utah.edu@wasatch.UUCP (Robert R. Kessler) writes: >We are having problems with SCO 386 Xenix and are looking for some help. > >Here is the scenario: > ...... (deleted much relating to system slowdown.).... > . When a user runs our software, they all login with the >same user id which starts executing our own user interface shell I have seen this occur on older Xenix systems - Tandy 6000 comes to mind. System would grind to a halt. The problem was that someone decided all users should have the same login id. When everyone used their own logins things became tolerable again. (As tolerable as 6mhz multiuser 68000k machine can be.) >hang is interesting. If you have a terminal sitting at the regular sh >prompt, you can type carriage returns and the prompt is echoed. If >you do any command (ps, shutdown, etc) then it just goes away and >doesn't respond. You can still type on the terminal, characters are >echoed, but nothing happens. You can also switch to a different >screen and bring up the system prompt. However, if you try to type on >this screen, nothing is echoed. Sounds like it might be being "swapped to death" and/or run out of processes. Still can be related to "one user" loggin in multiple times. >Any help would be greatly appreciated. Just a thought. It might not be applicable in your case. -- Bill Vermillion - UUCP: {uiucuxc,hoptoad,petsd}!peora!rtmvax!bilver!bill : bill@bilver.UUCP
kessler%cons.utah.edu@wasatch.UUCP (Robert R. Kessler) (12/13/88)
Oops, this should have gone to the whole mailing list: To: jbayer@ispi.UUCP Subject: Re: SCO Xenix System Hang Newsgroups: comp.unix.xenix In-Reply-To: <347@ispi.UUCP> References: <766@wasatch.UUCP> Organization: University of Utah, Computer Science Dept. Cc: Bcc: In article <347@ispi.UUCP> you write: >In article <766@wasatch.UUCP>, kessler%cons.utah.edu@wasatch.UUCP (Robert R. Kessler) writes: >> We are having problems with SCO 386 Xenix and are looking for some help. >> [Lines deleted] >> terminals. When a user runs our software, they all login with the >> same user id which starts executing our own user interface shell > ^^^^^^^^^^^ >This is a very bad idea. We had a customer who did the same thing. The >system would hang if too many people would log on using the same id. Did you see the same kind of hang that I outlined? Where it just died after a while, or did it hang when someone tried to login? >Make sure they use different ids. Otherwise it seems that internal >system tables are getting filled up. As an experiment, that would be easy to try. > >Also, how much memory is in the system? We figure that the base system >needs about 1 meg, and then we add 512k for each additional terminal. It >seems to work. As I understand RM/Cobol is a memory hog. > That particular system has 4 Meg. The COMPAQ that I was testing the COBOL program that hung with concurrent access has 5 Meg (and I could hang it with only 2 users). > >Jonathan Bayer >Intelligent Software Products, Inc. By the way -- as suggested by Paul De Bra, I installed a script which slept for 5 minutes and then dumped out a ps -el into a file. It was installed and then ran waiting for the system to hang. It ran for 6 hours without a system hang -- a new record for length of time under full usage. They had to reboot the system to leave for the evening, so it never did hang. We are rerunning the experiment today. Boy, what a strange coincidence. Bob.
daveh@marob.MASA.COM (Dave Hammond) (12/15/88)
>In article <329@bilver.UUCP> bill@bilver.UUCP (bill vermillion) writes: >We are having problems with SCO 386 Xenix and are looking for some help. >[...] > . When a user runs our software, they all login with the >same user id which starts executing our own user interface shell ^^^^^^^^^^^^ (problem lies here) >hang is interesting. If you have a terminal sitting at the regular sh >prompt, you can type carriage returns and the prompt is echoed. If >you do any command (ps, shutdown, etc) then it just goes away and >doesn't respond. You can still type on the terminal, characters are >echoed, but nothing happens. You can also switch to a different >screen and bring up the system prompt. However, if you try to type on >this screen, nothing is echoed. You have exceeded the system-imposed open files limit (_NFILES in stdio.h) for a single user-id. The only solution is to assign individual user accounts to each of your users. The shell prompt will continue to be issued, as no files are opened by simply pressing Return. However, as soon as a valid (non-builtin) command is issued, the shell must fork/exec the child program, which involves opening more files (the program binary itself, at the very least) and fails due to the afore-mentioned file table overflow. BTW, unless there is a Real Good Reason for having multiple users with the same user-id, it is far more appropriate for security, file management and system administration purposes to assign individual user-ids. -- Dave Hammond ...!uunet!masa.com!{marob,dsix2}!daveh
ruediger@ramz.UUCP (Ruediger Helsch) (12/16/88)
In article <329@bilver.UUCP> bill@bilver.UUCP (bill vermillion) writes: >In article <766@wasatch.UUCP> kessler%cons.utah.edu@wasatch.UUCP (Robert R. Kessler) writes: >> You can still type on the terminal, characters are >>echoed, but nothing happens. You can also switch to a different >>screen and bring up the system prompt. However, if you try to type on >>this screen, nothing is echoed. > >Sounds like it might be being "swapped to death" and/or run out of processes. >Still can be related to "one user" loggin in multiple times. > Our system (Xenix release 2.2.1 on an AT) showed the same behaviour when i once tried to run it out of memory (repeated forking or repeated malloc()ing). Malloc() of fork() should eventually return an error (after swapping everything out, what i wanted to do). But they don't. They just seem to try forever. It seems to me that request for memory can't fail in Xenix. Of course they also can't return if memory limits are exceeded. They just hang there, and the system does its best searching for memory, swapping round and round. Possibly the COBOL runtime system requests more and more memory, without ever freeing it. After some hours they exceed the available memory and swapping starts.
dyer@spdcc.COM (Steve Dyer) (12/16/88)
In article <418@marob.MASA.COM> daveh@marob.masa.com (Dave Hammond) writes: >>[description of a hung system] > >You have exceeded the system-imposed open files limit (_NFILES in stdio.h) >for a single user-id. The only solution is to assign individual user >accounts to each of your users. The shell prompt will continue to be issued, >as no files are opened by simply pressing Return. However, as soon as a >valid (non-builtin) command is issued, the shell must fork/exec the child >program, which involves opening more files (the program binary itself, >at the very least) and fails due to the afore-mentioned file table overflow. I'm afraid that you have things a little mixed up. There is a global (per-system) limit on the maximum number of files which may be open at once. In most versions of UNIX, objects of type "struct file" are allocated from a global array, "file[NFILE]", where NFILE may be a compile-time manifest constant or a parameter subject to tuning used at boot time for memory allocation. Attempting to run over this generates a kernel printf "file table overflow" on the console of most versions of UNIX, and the offending system call to return -1 with an errno of ENFILE. There is a per-process limit on the maximum number of open files, usually referred to as NOFILES, and this ranges from 20-60 or more, depending on your variant of UNIX. Hopefully, _NFILE (from stdio.h) will agree with NOFILES, though I have come across flavors where the kernel folks weren't talking to the library folks. An open/dup/creat system call will return -1 with errno set to EMFILE. There is a limit on the maximum number of simultaneously-running processes for a non-root user id, known as MAXUPRC, which is usually something like 25. fork() returns -1 with errno set to EAGAIN if MAXUPRC is reached.
gordon@sneaky.TANDY.COM (Gordon Burditt) (12/19/88)
The described "hang": having the system run very slowly, but having different users use different logins fixes or reduces the problem, is caused by the per-user-id process limit. There are several features that contribute to this: 1. There is a limit to the number of processes a non-root user may have running at one time, called MAXUPRC. The default is probably something like 15. If you can re-link the kernel, you can probably raise this limit. (I am working from an old version (non-*86) of SCO Xenix System III, so some of this may have changed). Raising this limit does not increase the size of any tables. If you need lots of processes (and this problem will exist regardless of what uid's they run under), you may need to increase the number of entries for processes, open files, and inodes. If you run out of open files or inodes, you get cryptic console messages like "no file". If you run out of system processes, you may get error messages (csh), or just lots of retrying (sh). Fix: don't run everything under the same uid, and/or raise MAXUPRC. 2. If the "sh" shell gets an error on a "fork" due to running over the MAXUPRC limit, it retries. Forever, unless interrupted by a signal. For a quick test of this, log in, then type sh<RETURN> repeatedly. After about 15 or 20 times, you won't get another prompt. Use your interrupt character to unlock the terminal. Then type lots of control-D's to get rid of all those shells. Now, imagine three users logged in under the same user id. Each has 5 processes, and is trying to create a 6th with sh. None of them will get any work done until one of them aborts the retries and terminates that shell. Whether or not the interrupt character can do this, and whether trying will destroy data, depends on the application. The scenario this retrying is supposed to handle is waiting for another, independent job started from the same terminal to release its processes (say, a mailer doing background delivery) without requiring more. Fix: Try to arrange jobs to not require deep nesting of processes. 3. To further complicate the situation, some applications don't wait for their children. A process occupies a process slot until its parent (or if its parent dies, its foster parent, process 1) waits for it. If an application keeps spinning off background print jobs, and never waits for them to finish, eventually it will hit MAXUPRC or run the system out of processes. These will show up on a ps as zombie processes with parents other than process 1. A compromise for this might be to allow one outstanding background job, and after spinning off the second one, wait for one of them to finish. Also, doing a wait() with an alarm() set can pick up already-terminated processes without waiting for all of them. Fix: applications should wait for their children. (Also check the status and report problems!) 4. This one gets a little exotic, and may be specific to the system I am using. It doesn't apply to systems that do paging instead of swapping. It also isn't related to running everything under one user id. There is a limit to the maximum amount of memory a process can use at one time in the kernel. Suppose that the kernel is re-configured to raise this limit to above the amount of memory available. (Limit > physical memory - kernel memory. "Available memory" means the maximum amount of memory a process can get without hanging the system, after administrative restrictions are raised.) Now, have an application program request 110% of available memory. This request will fail. Have an application request 100% of available memory plus one allocation unit. This request doesn't fail (but it should). The process gets swapped out and tries to swap back in. In the process, the swapper swaps everything else out. You can't kill the huge process because it needs to swap in to die. Something else running may lock up behind this process, or it may run, but slowly because it keeps getting swapped out. The fix for this is to not let processes get away with requesting that much memory. The easiest way is to lower the "administrative limit" maxprocmem. This may not be present in System V, or it may exist in another form. Gordon L. Burditt ...!texbell!sneaky!gordon
domo@riddle.UUCP (Dominic Dunlop) (12/19/88)
In article <418@marob.MASA.COM> daveh@marob.masa.com (Dave Hammond) writes: >>In article <329@bilver.UUCP> bill@bilver.UUCP (bill vermillion) writes: >>[...] >> . When a user runs our software, they all login with the >>same user id which starts executing our own user interface shell > ^^^^^^^^^^^^ > (problem lies here) >... >BTW, unless there is a Real Good Reason for having multiple users with the >same user-id, it is far more appropriate for security, file management and >system administration purposes to assign individual user-ids. Absolutely. Bill can probably achieve the effect he desires in terms of file access permissions by putting all of the users of his application into a single, dedicated group -- let's call it billsapp. This needs a line like 300::billsapp:appuser1,appuser2,appuser3,appuser4 in /etc/group, together with 300 (or whatever group id he assigns) in the fourth field of the /etc/passwd entry for each of those users. For example appuser3:ecBxt/7Df8qlf:123:300:Bill's third user:/u/app/:/bin/sh The SCO Xenix mkuser (user administration) tools could help Bill to set things up in this way, except that, as I recall (can't run it up just now -- the closest system's sufficiently secure that I don't know the root password), it does not like assigning a new user to an existing home directory, an action which is perfectly proper in a case such as this. So Bill will probably have to do things by hand (SCO, please note). This done, all files used by the application should be fixed to have write permission for group. Directories used for such things as scratch and spool files need group write permission too. Often, it's sufficient to do this set-up by hand as part of installation, as many applications don't need to _create_ shareable files when they run; they simply need to update existing shareable files, and to create user-private scratch files. Where shareable files must be created -- for example, if one user generates a print spooling file which can subsequently be reprinted by another user -- the application must create a shareable file. The best way to do this is to modify the application so that it explicitly gives a suitable file permissions mask value to the corresponding creat() or write() call. To be absolutely certain of getting a particular set of permissions, whatever the value of the user's file creation permissions mask (umask), it's necessary to call chmod() after file creation. However, it's often sufficient to set up a suitable umask in the .profile script run by /bin/sh when a user logs into the system, provided that calls on open() and creat() in the application specify permissive creation permissions -- typically 0666. (Most programs, including BASIC, COBOL, and ``fourth-generation'' run-time systems, do this). For example umask 0660 # read/write for user, group PATH=/u/billsbin:$PATH export PATH cd /u/billsapp/billsdata exec billsapp in /u/billsapp/.profile will throw any user whose home directory is /u/billsapp (and whose login shell is /bin/sh) straight into the application after setting up a suitable file creation mask, executable search path, and current directory. My company uses a procedure very like this to run up an accounting package which we bought in binary form (and so couldn't modify), and which was very insecure straight out of the box. (All the permissions were wide-open, so anybody could do anything, whether they were an accounts person or not.) The method I've outlined is probably sufficient for most applications. However, there are times when more sophistication is required -- for example, the case where an application needs to write a file which can be accessed only by the application, not even by UNIX shell commands run by the user who created the file. In such cases, you need to get involved with set-group-id programs. I'd counsel against doing this unless you really need to, as a) They're surprisingly difficult to get right (AT&T's mail gets it right; AT&T's lp gets it wrong); and b) It's a good way to write applications which are not easily portable between AT&T and Berkeley variants of UNIX (Xenix is in the AT&T camp on this issue). To read further... _UNIX System Security_ by Wood & Kochan (Hayden, 1985, ISBN 0-8104-6267-2) is a good place to start. Hope this helps. -- Dominic Dunlop domo@sphinx.co.uk domo@riddle.uucp
chip@vector.UUCP (Chip Rosenthal) (12/20/88)
How about checking the simple stuff first? Is your 386 an old 15MHz CPU? If so, you might be hitting the infamous "Errata 20" bug, and in which case SCO has a fix disk available. -- Chip Rosenthal chip@vector.UUCP | Choke me in the shallow water Dallas Semiconductor 214-450-5337 | before I get too deep.
steven@lakesys.UUCP (Steven Goodman) (12/23/88)
In article <671@vector.UUCP> chip@vector.UUCP (Chip Rosenthal) writes: >How about checking the simple stuff first? Is your 386 an old 15MHz CPU? >If so, you might be hitting the infamous "Errata 20" bug, and in which case >SCO has a fix disk available. >-- If this bug appears and you happen to have a Compaq, contact your dealer. Compaq will ( free of charge ) add a small board which plugs into your 386 socket which will fix the "errata 20" bug. Not wanting to repeat what might have alreading been said on this subject but this bug only accurs when using a 387 chip and you can boot up your system to ignore the 387 until you receive some kinda fix. From what I understand this is NOT fixable via software. -- Steven M. Goodman Lake Systems - Milwaukee, Wisconsin uunet!marque!lakesys!steven uwvax!uwmcsd1!lakesys!steven