[comp.sys.sgi] login problem.

doelz@urz.unibas.ch (Reinhard Doelz) (03/01/90)

GREAT. I was reporting this problem to SGI in fall 1989, and they (ISO)
told me thatothers encounter the same problem. I hacked a workaround
which temporarily fixes the problem, because even 3.2.1. didn't fix it. 
The following is a repost of a message I sent to INFO-IRIS in december.
Hope this helps...

Reinhard 

===========================================

We are running a 120GTX OS 3.2.1.
The program shown below runs on two processors. 
The graphics manager fails to start up and the graphics is unusable (No 
window manager).

*DOCUMENTATION:*
The /usr/adm/SYSLOG says  (truncated, only significant lines shown)

Dec  7 09:17:00 modl grcond[290]: CIO: IRIX System V Release 3.2
                                 IP5 Version 10171414
Dec  7 09:17:00 modl grcond[290]: CIO: CPU 1 taking over time and accounting 
                                  functions
Dec  7 09:17:00 modl grcond[290]: CIO: gfx_wait_cx:  context switch timed out
Dec  7 09:17:00 modl grcond[290]: CIO:
Dec  7 09:17:00 modl grcond[290]: CIO: gm-2 (configured for IP5) 1.14+
Dec  7 09:17:00 modl grcond[290]: CIO:
Dec  7 09:17:00 modl grcond[290]: CIO: DEBUG_NOISE at 0x9806648C
Dec  7 09:17:00 modl grcond[290]: CIO: Loading PP ucode Version:  
                                  @(#) PEAPOD 1.2 pp microcode assembler 
                                  - 6/20/87
Dec  7 09:17:00 modl grcond[290]: CIO: Sat Aug 19 19:10:21 1989 
                                  user unknown revision(1.123CLOVER2IP5GT)
Dec  7 09:17:00 modl grcond[290]: CIO: 

tried and failed ... as reported.

... and therefore I conclude that the grcond is unable to start up. 

*WORKAROUND:*
The IRIS is fully networked running nfs, 4DDN and TCP/IP thus eventually 
suffering from this. Therefore, I changed the kernel in /usr/sysgen/master.d
to read the network on CPU0 as follows: 

107c107
< #define NBUF 100              /* # buffers in disk buffer cache */
---
> #define NBUF 400              /* # buffers in disk buffer cache */
215c215
< #define MAXSC 26
---
> #define MAXSC 30
353c353
< int           network_processor = 1;
---
> int           network_processor = 0;
modl [/usr/sysgen/master.d] %

... did an lboot and problem solved. 

*PROBLEM REPRODUCTION:*
The fortran program causing the problem is a special application. However, 
a C program will do it as well. The following is a dummy routine which 
performs the crashes: 

        real x(100,1000), y(100,1000)
        seed=123456
        do 100 i=1,100
                do 101 ii=1,1000
                        x(ii,i)=rand(seed)
101                     continue
100             continue
        write (6,*)'ran done'
        do 200 i=1,100
                do 201 ii=1,1000
                        y(ii,i)=sin(x(ii,i))*cos(x(ii,i))
                        y(ii,i)=
     *                  (y(ii,i)**1.003)**(1-(sin(x(ii,i))/1000))
                        do 202 iii=1,900
                                y(ii,i)=
     *                          y(ii,i)**(1-(sin(x(ii,i))/1000))
202                             continue
201                     continue
200             continue
        stop
        end

pfa concurrentizes the 200 do loop which gives a fully paralelly running 
program. 

*ALTERNATIVE WORKAROUND:*
In order to avoid the kernel modification you could also log in from 
another (not the console) terminal, or even log in as root NOGRAPHICS, 
call the debugger saying dbx -p # ( # being the parallel job) which is 
equivalent to sending a schedctl call to this process. One could 
do it more elegantly by using a small C routine but I didn't bother about 
that. 

  
  ************************************************************************
  *   Dr. Reinhard Doelz           *           SWITZERLAND               *
  *     Biocomputing               *                                     *
  *      Biozentrum                * doelz%urz.unibas.ch@relay.cs.net    *
  * Klingelbergstrasse 70          *                                     *
  *     CH-4056 Basel              *                                     *
  ************************************************************************

dixons%phvax.dnet@SMITHKLINE.COM (03/01/90)

I had the same problem of getting logged in on the console on a 4d240 when
is had compute bound jobs running on all four processors.  In contrast to
the other recent postings, when I called the hot line about this, I was told
that it was a known problem, that it would be fixed in 3.2.2 (which is
"ship on fail") and I was given a workaround similar to that posted by
Reinhard Doelz.  That fixed the problem. However, when I first called
the hotline, it was handled as a hardware problem and only when I noticed that
the problem went away when the machine was not loaded did they pick up
on the software problem.  It seems to depend on whether or not the call
is passed to a hardware or software person.  Never the less, the hotline
had the fix and I got it from them without much problem.
Scott Dixon (dixons@smithkline.com)