doelz@urz.unibas.ch (Reinhard Doelz) (03/01/90)
GREAT. I was reporting this problem to SGI in fall 1989, and they (ISO)
told me thatothers encounter the same problem. I hacked a workaround
which temporarily fixes the problem, because even 3.2.1. didn't fix it.
The following is a repost of a message I sent to INFO-IRIS in december.
Hope this helps...
Reinhard
===========================================
We are running a 120GTX OS 3.2.1.
The program shown below runs on two processors.
The graphics manager fails to start up and the graphics is unusable (No
window manager).
*DOCUMENTATION:*
The /usr/adm/SYSLOG says (truncated, only significant lines shown)
Dec 7 09:17:00 modl grcond[290]: CIO: IRIX System V Release 3.2
IP5 Version 10171414
Dec 7 09:17:00 modl grcond[290]: CIO: CPU 1 taking over time and accounting
functions
Dec 7 09:17:00 modl grcond[290]: CIO: gfx_wait_cx: context switch timed out
Dec 7 09:17:00 modl grcond[290]: CIO:
Dec 7 09:17:00 modl grcond[290]: CIO: gm-2 (configured for IP5) 1.14+
Dec 7 09:17:00 modl grcond[290]: CIO:
Dec 7 09:17:00 modl grcond[290]: CIO: DEBUG_NOISE at 0x9806648C
Dec 7 09:17:00 modl grcond[290]: CIO: Loading PP ucode Version:
@(#) PEAPOD 1.2 pp microcode assembler
- 6/20/87
Dec 7 09:17:00 modl grcond[290]: CIO: Sat Aug 19 19:10:21 1989
user unknown revision(1.123CLOVER2IP5GT)
Dec 7 09:17:00 modl grcond[290]: CIO:
tried and failed ... as reported.
... and therefore I conclude that the grcond is unable to start up.
*WORKAROUND:*
The IRIS is fully networked running nfs, 4DDN and TCP/IP thus eventually
suffering from this. Therefore, I changed the kernel in /usr/sysgen/master.d
to read the network on CPU0 as follows:
107c107
< #define NBUF 100 /* # buffers in disk buffer cache */
---
> #define NBUF 400 /* # buffers in disk buffer cache */
215c215
< #define MAXSC 26
---
> #define MAXSC 30
353c353
< int network_processor = 1;
---
> int network_processor = 0;
modl [/usr/sysgen/master.d] %
... did an lboot and problem solved.
*PROBLEM REPRODUCTION:*
The fortran program causing the problem is a special application. However,
a C program will do it as well. The following is a dummy routine which
performs the crashes:
real x(100,1000), y(100,1000)
seed=123456
do 100 i=1,100
do 101 ii=1,1000
x(ii,i)=rand(seed)
101 continue
100 continue
write (6,*)'ran done'
do 200 i=1,100
do 201 ii=1,1000
y(ii,i)=sin(x(ii,i))*cos(x(ii,i))
y(ii,i)=
* (y(ii,i)**1.003)**(1-(sin(x(ii,i))/1000))
do 202 iii=1,900
y(ii,i)=
* y(ii,i)**(1-(sin(x(ii,i))/1000))
202 continue
201 continue
200 continue
stop
end
pfa concurrentizes the 200 do loop which gives a fully paralelly running
program.
*ALTERNATIVE WORKAROUND:*
In order to avoid the kernel modification you could also log in from
another (not the console) terminal, or even log in as root NOGRAPHICS,
call the debugger saying dbx -p # ( # being the parallel job) which is
equivalent to sending a schedctl call to this process. One could
do it more elegantly by using a small C routine but I didn't bother about
that.
************************************************************************
* Dr. Reinhard Doelz * SWITZERLAND *
* Biocomputing * *
* Biozentrum * doelz%urz.unibas.ch@relay.cs.net *
* Klingelbergstrasse 70 * *
* CH-4056 Basel * *
************************************************************************dixons%phvax.dnet@SMITHKLINE.COM (03/01/90)
I had the same problem of getting logged in on the console on a 4d240 when is had compute bound jobs running on all four processors. In contrast to the other recent postings, when I called the hot line about this, I was told that it was a known problem, that it would be fixed in 3.2.2 (which is "ship on fail") and I was given a workaround similar to that posted by Reinhard Doelz. That fixed the problem. However, when I first called the hotline, it was handled as a hardware problem and only when I noticed that the problem went away when the machine was not loaded did they pick up on the software problem. It seems to depend on whether or not the call is passed to a hardware or software person. Never the less, the hotline had the fix and I got it from them without much problem. Scott Dixon (dixons@smithkline.com)