[comp.sys.apollo] DN3500 refuse to get into DM

sun@me.utoronto.ca (Andy Sun Anu-guest) (01/13/90)

There is a strange thing happen to an Apollo DN3500 that I've never
seen and doesn't seem to find a cure for it. We've phoned up Apollo
tech. support and they don't have a clue either. I was wondering some
Apollo experts on the net might shed some light on it. The problem is
this:

The Apollo that has problem is connected to an Apollo ring, consisting
of two other Apollo DN3000s. It is running SR10.1. The other two machines
are running SR9.7 (yucky, isn't it?). I don't know if anyone has done
something to it or it decided to go on strike itself, it just won't
load DM. I tried soft boot it (shutdown and reboot) and hardware reset
(press the little white button at the back of it). Everything went
well (disk check passed, salvage boot volume okay, loading Init okay)
but that's it. It threw me back to a "Phase II Shell" with a ")" prompt
instead of loading the display manager and ask for login. I manually
type in dm to try loading DM from that shell, but all I got was something
like (pm_$init) 3040001. I can still access its disk from the other SR9.7
machines (so the network is probably okay), but I cannot get it to do
anything else. Anyone have seen this kind of thing happened before? Any
help/suggestions are welcome.

Andy

-- 

_______________________________________________________________________________
Andy Sun                        | Internet: sun@me.utoronto.ca
University of Toronto, Canada   | UUCP    : csri.toronto.edu!me.utoronto.ca!sun
Dept. of Mechanical Engineering | BITNET  : sun@me.utoronto.BITNET

kts@quintro.uucp (Kenneth T. Smelcer) (01/16/90)

In article <1990Jan12.172440.851@me.toronto.edu> sun@me.utoronto.ca writes:
>The Apollo that has problem is connected to an Apollo ring, consisting
>of two other Apollo DN3000s. It is running SR10.1. The other two machines
>are running SR9.7 (yucky, isn't it?). I don't know if anyone has done
>something to it or it decided to go on strike itself, it just won't
>load DM. I tried soft boot it (shutdown and reboot) and hardware reset
>(press the little white button at the back of it). Everything went
>well (disk check passed, salvage boot volume okay, loading Init okay)
>but that's it. It threw me back to a "Phase II Shell" with a ")" prompt
>instead of loading the display manager and ask for login. I manually
>type in dm to try loading DM from that shell, but all I got was something
>like (pm_$init) 3040001. I can still access its disk from the other SR9.7
>machines (so the network is probably okay), but I cannot get it to do
>anything else. Anyone have seen this kind of thing happened before? Any
>help/suggestions are welcome.
>_______________________________________________________________________________
>Andy Sun                        | Internet: sun@me.utoronto.ca
>University of Toronto, Canada   | UUCP    : csri.toronto.edu!me.utoronto.ca!sun
>Dept. of Mechanical Engineering | BITNET  : sun@me.utoronto.BITNET

We've had a similar problem with one of our DN3000s running 10.1.  
The problem is that the /etc/ttys file (actually `node_data/etc/ttys)
occasionally disappears.  Without this file, Init doesn't know to start
DM or any other login process, so it just exits.

The solution is to re-create the file from another node, SHUTdown (not Reset)
the problem node and reboot.

Hope this helps!
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Ken Smelcer        Glenayre Corp.           quintro!kts@lll-winken 
                   Quincy,  IL              tiamat!quintro!kts@uunet

lampi@pnet02.gryphon.com (Michael Lampi) (01/16/90)

Sounds to me like you might have the "service mode" switch set to "service".
It's a small toggle switch adjacent to the reset button. If it's in the down
position, then it is in service mode.

BTW, you can start the DM at the Phase II Shell by entering the "GO" command.

Michael Lampi               MDL Corporation   213/782-7888   fax 213/782-7927

UUCP: {ames!elroy, <routing site>}!gryphon!pnet02!lampi
INET: lampi@pnet02.gryphon.com
"My opinions are that of my corporation!"

dbfunk@ICAEN.UIOWA.EDU (David B Funk) (01/16/90)

In posting <1990Jan12.172440.851@me.toronto.edu> Andy Sun <sun@me.utoronto.ca> writes:

> There is a strange thing happen to an Apollo DN3500 ... The problem is
> this:
> 
> The Apollo that has problem is connected to an Apollo ring, consisting
> of two other Apollo DN3000s. It is running SR10.1. The other two machines
> are running SR9.7 (yucky, isn't it?). I don't know if anyone has done
> something to it or it decided to go on strike itself, it just won't
> load DM. I tried soft boot it (shutdown and reboot) and hardware reset
> (press the little white button at the back of it). Everything went
> well (disk check passed, salvage boot volume okay, loading Init okay)
> but that's it. It threw me back to a "Phase II Shell" with a ")" prompt
> instead of loading the display manager and ask for login. I manually
> type in dm to try loading DM from that shell, but all I got was something
> like (pm_$init) 3040001. I can still access its disk from the other SR9.7
> machines (so the network is probably okay), but I cannot get it to do
> anything else. Anyone have seen this kind of thing happened before? Any
> help/suggestions are welcome.

I saw something like this once. I saw 2 machines at another site that were
sick, one wouldn't get out of the Phase II shell, the other wouldn't load
the DM. In both cases it was a user (dumb sys_admin) that caused the problem.
This person had logged in as "root" and had opened a DM edit pad on "/etc/init"
on the first machine and on "/etc/dm_or_spm" on the second. This seriously
wounded these 2 programs so that they couldn't do their job. I would guess
that if the DM itself (/sys/dm/dm) were shot you'd get that kind of problem.
There are possibly other system files that could cause such problems if they
were messed up (such as /sys/env, /etc/sys.conf, or something in /lib).
Try poking around in /sys, /sys/node_data, /etc, & /lib and look for a
system file that has been modified recently that shouldn't have been.
Look in `node_data/system_logs on that machine to see if there's some
kind of error log file that you can read. If you had another sr10 machine
handy it would help alot. You could just start doing 'cmt's on various
system directories. Try starting the "SPM" to see if it is just the DM
that is hurt or something more basic. At the Phase II shell prompt,
type in "spm" to force it to start the server process manager. If the
SPM starts OK then it's something directly connected with the DM.
If the SPM croaks too then its more basic, like a messed up library
or /sys/env.

Dave Funk

dente@els.uucp (Colin Dente) (01/16/90)

In article <1990Jan12.172440.851@me.toronto.edu> sun@me.utoronto.ca writes:
>There is a strange thing happen to an Apollo DN3500 that I've never
>seen and doesn't seem to find a cure for it. We've phoned up Apollo
Help might be at hand...

>... it just won't
>load DM. I tried soft boot it (shutdown and reboot) and hardware reset
>(press the little white button at the back of it). Everything went
>well (disk check passed, salvage boot volume okay, loading Init okay)
>but that's it. It threw me back to a "Phase II Shell" with a ")" prompt
>instead of loading the display manager and ask for login. I manually
>type in dm to try loading DM from that shell, but all I got was something
>like (pm_$init) 3040001. I can still access its disk from the other SR9.7
>machines (so the network is probably okay), but I cannot get it to do
>anything else. Anyone have seen this kind of thing happened before?

I do remember having a problem like this - though I don't recall getting
a returned status of 3040001 (cleanup handler released out of order) -
sounds *weird*.  The problem that I had was that /dev/null had somehow
been corrupted - causing the DM to gracelessly exit when you tried to
start it.  I tried starting the DM with debug set to 1 - no help.  In the
end, out of desparation, I invoked spm with debug (dunno if the debug was
necessary) - and lo and behold - the last thing it did before dying was
print something like 'cant open /dev/null'.  One new /dev/null later, and
everything was hunky dorey - far more pleasant than the OS reload that
Apollo suggested.

Colin

 Colin Dente                      | JANET: dente@uk.ac.man.ee.els
 Dept. of Electrical Engineering  | ARPA:  dente@els.ee.man.ac.uk 
 University of Manchester         | UUCP:  ...!mcvax!ukc!man.ee.els!dente 
 England                          | These might work now, but then again...

crh@BLUEBIRD.ENG.OHIO-STATE.EDU (Charlotte Hawley) (01/16/90)

I've seen this happen twice.  One of the first systems
where I installed SR10.1; I rebooted from the tape and
redid the install as an update.  A couple of file were
read in and everything worked fine.

A friend of my just had this problem - seems Apollo had
the wrong disk controller.  He was going from 10.1 to 9.7
to run some special application software.

krowitz%richter@UMIX.CC.UMICH.EDU (David Krowitz) (01/17/90)

Typing the command "stcode 3040001" gets me:

leanup handler released out of order (process manager/process fault manager)

Sounds like your SR10 software installation has been mucked up ... probably
some of the stuff in /sys/node_data. One of the likely candiates is the
/sys/boot_shell file (it the one *I* always trash!). You can copy a new version
from another node.


 -- David Krowitz

krowitz@richter.mit.edu   (18.83.0.109)
krowitz%richter.mit.edu@eddie.mit.edu
krowitz%richter.mit.edu@mitvma.bitnet
(in order of decreasing preference)

ross@cancol.oz (Ross Johnson) (01/17/90)

In article <1990Jan12.172440.851@me.toronto.edu>, sun@me.utoronto.ca (Andy Sun Anu-guest) writes:
> There is a strange thing happen to an Apollo DN3500 that I've never
> seen and doesn't seem to find a cure for it. We've phoned up Apollo

The same thing has happened to me twice on a DN4500 running 10.1 (including
10.1.2 I think). I believe I fixed the problem each time by copying /sys/dm/dm
from another node and rebooting. I say "believe" because as far as I could
tell, the old and new dm's where the same in all respects (maybe it just
needs some attention now and then :-)

This is the only machine this has happened to (we have a couple of DN3x00
machines in heavy use as well).

The 4500 is a routing node and a server for 8 PC's running PCI.

Hope this helps.

+----------------------+---+
| Ross Johnson	       |:-)|  ACSnet: ross@cancol.oz.au
| Info. Sciences & Eng.|___|  ARPA:   ross%cancol.oz.au@uunet.uu.net
| Uni. of Canberra         |  UUCP:   {uunet,ukc}!munnari!cancol.oz.au!ross
| P.O. Box 1               |  CSNET:  ross%cancol.oz@australia
| Belconnen  A.C.T. 2616   |  JANET:  ross%au.oz.cancol@EAN-RELAY
| AUSTRALIA                |  BITNET: ross%cancol.oz.au@relay.cs.net
+--------------------------+