[comp.sys.apollo] More problems with SR10.1 on 2nd disk

bala@synopsys.synopsys.com (Bala Vasireddi) (06/01/90)

We are having some more problems with our DN4500 (SR10.1) again. The 
OS paging problems are solved, thanks to various responses from a lot of folks.

Please bear with me if this is a very trivial problem. I am an Apollo neophyte.

Problem:
--------

The DN4500 boots off the 2nd disk running SR10.1 (the 1st disk has SR10.2
loaded on it). During the first couple of minutes, I can see all the Apollos 
on the network by running 'lcnode'. I can also list/access all the files on 
another Apollo ("kings") by doing 'ls //kings'. This seems to work only for 
2 or 3 tries.  Around the 3rd or 4th try, everything freezes up (and
the machine becomes virtually unusable). After what seems an eternity
the 'ls //kings' command returns with output like:

file XXXX not found
file YYYY not found
etc. etc.

At this point doing an 'lcnode' also shows that, this node is not seeing
any other Apollo node on the network. Or for that matter trying to 'rlogin' 
into this machine from any of our Suns (or vice-versa) doesn't work.

Can somebody please point me as to what is wrong with this machine?

If it helps, I used the following procedure to setup this machine:

1. invol'ed the 2nd disk using the SR10.2 invol
2. Installed Sr10.1 from AA on the 1st disk using the following commands:

      config -a AA -c configuration_file  (to create config file)
      install -pvx -c configuration_file -s AA target (install the SR10.1)

3. The Master registry server is the DN10k ("kings") runing SR10.2
4. Loaded NFS 2.0 on this node to mount filesystems from our Suns.

What am I overlooking? Any help is appreciated.

Thanks

-------
Bala Vasireddi,            Phone: (415)962-5036
Synopsys, Inc.             FAX:   (415)965-8637
1098 Alta Ave              DDN:   bala@synopsys.com
Mountain View, CA 94043    UUCP:  ..!fernwood.mpk.ca.us!synopsys!bala

thompson@PAN.SSEC.HONEYWELL.COM (John Thompson) (06/01/90)

> Problem:
> --------
> The DN4500 boots off the 2nd disk running SR10.1 (the 1st disk has SR10.2
> loaded on it). During the first couple of minutes, I can see all the Apollos 
> on the network by running 'lcnode'. I can also list/access all the files on 
> another Apollo ("kings") by doing 'ls //kings'. This seems to work only for 
> 2 or 3 tries.  Around the 3rd or 4th try, everything freezes up (and
> the machine becomes virtually unusable). After what seems an eternity
> the 'ls //kings' command returns with output like:
> 
> file XXXX not found
> file YYYY not found
> etc. etc.
> 
> At this point doing an 'lcnode' also shows that, this node is not seeing
> any other Apollo node on the network. Or for that matter trying to 'rlogin' 
> into this machine from any of our Suns (or vice-versa) doesn't work.
> 
> Can somebody please point me as to what is wrong with this machine?

It sounds to me like somebody on your ethernet / token-ring is running rtsvc 
with a non-zero network ID (note: NOT tcp network number).  This is normally 
used when you set up a domain internet (aka transparent domain).  It provides 
the Domain Distributed System (DDS) across a fast internet (Full T1 speed or 
greater is the 'supported' speed -- we've done it at < 56K baud, I believe).  

At any rate, if you do a "rtsvc" on your various nodes, you'll find something
like 
.     $ rtsvc
.     
.         Controller        Net ID     Service offered
.     ==================   ========   ====================
.     RING                    28124   Own traffic only
.     ETH802.3_AT                 0   Port not open
.
My node has 2 controllers (it used to be a DDS router node.  The ring is
the only one we use (for DDS) now.  It's Network ID is 28124 (not really.
For security concerns I changed it from what it REALLY is).  If we had
routing turned on, the service offered would be "Internet routing."  I'd
guess that at least one node on your net has the net ID set, and is
broadcasting it.  Domain nodes figure out what DDS net they're in by 
using the "hint_file" in `node_data when they boot up.  If they hear
somebody broadcasting a different network, they update themselves after
a short time (15 minutes?), unless they are a router node (routing nodes are
the only ones that broadcast net-numbers.  If you have a couple nodes that
are set up with routing enabled (even if they only have 1 controller), your
nodes will eventually get confused if the net addresses conflict.

You can fix this by correcting the nodes with conflicting net numbers.  The
rtsvc command is located in /com (Aegis) and /etc, (since you appear to be a 
Unix house).

Now that I've spoken authoritatively on the subject, let me say that it doesn't
necessarily explain everything.  I would expect that you'd see at least ONE
other node on the Apollo network (lcnode), because SOMEBODY else would have
the same network ID.  rlogin (and all TCP/IP services) should continue to
work, unless you have file (/etc/hosts, for instance) linked over to a non-
communicating node (e.g. //kings).


Good luck!
John Thompson
Honeywell, SSEC
Plymouth, MN  55441
thompson@pan.ssec.honeywell.com
thompson@animal.ssec.honeywell.com

Don't blame Honeywell for my opinions.
Any address corruptions caused by the mailer should be send to /dev/mentor -- 
working with their mail system has ruined sendmail and my sanity.       :-(

thompson@PAN.SSEC.HONEYWELL.COM (John Thompson) (06/02/90)

Netlanders --
    Sorry about the partial mail message.  I accidentally used "." as an
indent character, and ended up with "." at the start of a line all by itself.
<sigh>
Here's the message I _meant_ to send out:

> Problem:
> --------
> The DN4500 boots off the 2nd disk running SR10.1 (the 1st disk has SR10.2
> loaded on it). During the first couple of minutes, I can see all the Apollos 
> on the network by running 'lcnode'. I can also list/access all the files on 
> another Apollo ("kings") by doing 'ls //kings'. This seems to work only for 
> 2 or 3 tries.  Around the 3rd or 4th try, everything freezes up (and
> the machine becomes virtually unusable). After what seems an eternity
> the 'ls //kings' command returns with output like:
> 
> file XXXX not found
> file YYYY not found
> etc. etc.
> 
> At this point doing an 'lcnode' also shows that, this node is not seeing
> any other Apollo node on the network. Or for that matter trying to 'rlogin' 
> into this machine from any of our Suns (or vice-versa) doesn't work.
> 
> Can somebody please point me as to what is wrong with this machine?

It sounds to me like somebody on your ethernet / token-ring is running rtsvc 
with a non-zero network ID (note: NOT tcp network number).  This is normally 
used when you set up a domain internet (aka transparent domain).  It provides 
the Domain Distributed System (DDS) across a fast internet (Full T1 speed or 
greater is the 'supported' speed -- we've done it at < 56K baud, I believe).  

At any rate, if you do a "rtsvc" on your various nodes, you'll find something
like 
:     $ rtsvc
:     
:         Controller        Net ID     Service offered
:     ==================   ========   ====================
:     RING                    28124   Own traffic only
:     ETH802.3_AT                 0   Port not open
:
Note that our token ring is network 28124 (not REALLY -- for security reasons 
I don't broadcast the real DDS network).  We used to have the ethernet running
DDS services too, but no longer need it (a sister division closed).  If routing
is enabled, you'll see a service of "Internet router".  Any node with that
service will broadcast the network that it believes is true.  Any node that
doesn't offer routing will listen for those packets, and after a short time 
(15 minutes?), will modify its own network if necessary.  The initial network
that a node uses is stored (non-readable) in `node_data/hint_file, if I
remember right.  It sounds to me like you have 2 or more nodes broadcasting 
different net numbers.  If the DDS net doesn't match, nodes won't talk to each
other except through a router node (2 controllers, each offering routing).
If your /etc/hosts table (or other tcp info) is linked from your DN4500 off to
another node, you won't be able to locate it, and rlogin will fail.

If this is the correct cause of the problem, just check all the nodes that
are physically connected to your ethernet / token-ring, ('rtsvc' is in /com
and /etc), and you'll find several conflicting networks.  Correct them
by using 'rtsvc -dev <DEVICE> -net <ID>', and you should be ok.  It might
be necessary to disconnect some nodes from the network before doing this, as
they may become so brain-dead that they can't cope with the rest of the nodes.

Good Luck!
John Thompson
Honeywell, SSEC
thompson@pan.ssec.honeywell.com
thompson@animal.ssec.honeywell.com
thompson%pan.ssec.honeywell.com@cim-vax.honeywell.com

Don't blame anyone but me for my opinions.  (Well, maybe my parents are responsible).