doelz@urz.unibas.ch (Reinhard Doelz) (03/01/90)
GREAT. I was reporting this problem to SGI in fall 1989, and they (ISO) told me thatothers encounter the same problem. I hacked a workaround which temporarily fixes the problem, because even 3.2.1. didn't fix it. The following is a repost of a message I sent to INFO-IRIS in december. Hope this helps... Reinhard =========================================== We are running a 120GTX OS 3.2.1. The program shown below runs on two processors. The graphics manager fails to start up and the graphics is unusable (No window manager). *DOCUMENTATION:* The /usr/adm/SYSLOG says (truncated, only significant lines shown) Dec 7 09:17:00 modl grcond[290]: CIO: IRIX System V Release 3.2 IP5 Version 10171414 Dec 7 09:17:00 modl grcond[290]: CIO: CPU 1 taking over time and accounting functions Dec 7 09:17:00 modl grcond[290]: CIO: gfx_wait_cx: context switch timed out Dec 7 09:17:00 modl grcond[290]: CIO: Dec 7 09:17:00 modl grcond[290]: CIO: gm-2 (configured for IP5) 1.14+ Dec 7 09:17:00 modl grcond[290]: CIO: Dec 7 09:17:00 modl grcond[290]: CIO: DEBUG_NOISE at 0x9806648C Dec 7 09:17:00 modl grcond[290]: CIO: Loading PP ucode Version: @(#) PEAPOD 1.2 pp microcode assembler - 6/20/87 Dec 7 09:17:00 modl grcond[290]: CIO: Sat Aug 19 19:10:21 1989 user unknown revision(1.123CLOVER2IP5GT) Dec 7 09:17:00 modl grcond[290]: CIO: tried and failed ... as reported. ... and therefore I conclude that the grcond is unable to start up. *WORKAROUND:* The IRIS is fully networked running nfs, 4DDN and TCP/IP thus eventually suffering from this. Therefore, I changed the kernel in /usr/sysgen/master.d to read the network on CPU0 as follows: 107c107 < #define NBUF 100 /* # buffers in disk buffer cache */ --- > #define NBUF 400 /* # buffers in disk buffer cache */ 215c215 < #define MAXSC 26 --- > #define MAXSC 30 353c353 < int network_processor = 1; --- > int network_processor = 0; modl [/usr/sysgen/master.d] % ... did an lboot and problem solved. *PROBLEM REPRODUCTION:* The fortran program causing the problem is a special application. However, a C program will do it as well. The following is a dummy routine which performs the crashes: real x(100,1000), y(100,1000) seed=123456 do 100 i=1,100 do 101 ii=1,1000 x(ii,i)=rand(seed) 101 continue 100 continue write (6,*)'ran done' do 200 i=1,100 do 201 ii=1,1000 y(ii,i)=sin(x(ii,i))*cos(x(ii,i)) y(ii,i)= * (y(ii,i)**1.003)**(1-(sin(x(ii,i))/1000)) do 202 iii=1,900 y(ii,i)= * y(ii,i)**(1-(sin(x(ii,i))/1000)) 202 continue 201 continue 200 continue stop end pfa concurrentizes the 200 do loop which gives a fully paralelly running program. *ALTERNATIVE WORKAROUND:* In order to avoid the kernel modification you could also log in from another (not the console) terminal, or even log in as root NOGRAPHICS, call the debugger saying dbx -p # ( # being the parallel job) which is equivalent to sending a schedctl call to this process. One could do it more elegantly by using a small C routine but I didn't bother about that. ************************************************************************ * Dr. Reinhard Doelz * SWITZERLAND * * Biocomputing * * * Biozentrum * doelz%urz.unibas.ch@relay.cs.net * * Klingelbergstrasse 70 * * * CH-4056 Basel * * ************************************************************************
dixons%phvax.dnet@SMITHKLINE.COM (03/01/90)
I had the same problem of getting logged in on the console on a 4d240 when is had compute bound jobs running on all four processors. In contrast to the other recent postings, when I called the hot line about this, I was told that it was a known problem, that it would be fixed in 3.2.2 (which is "ship on fail") and I was given a workaround similar to that posted by Reinhard Doelz. That fixed the problem. However, when I first called the hotline, it was handled as a hardware problem and only when I noticed that the problem went away when the machine was not loaded did they pick up on the software problem. It seems to depend on whether or not the call is passed to a hardware or software person. Never the less, the hotline had the fix and I got it from them without much problem. Scott Dixon (dixons@smithkline.com)