[comp.sys.apollo] more 'failing update SR10.1.p->SR10.2.p'

schmid@asterix.luftfahrt.uni-stuttgart.de (Georg Schmid) (07/10/90)

Because I received mail from some people which have the same problems 
running SR10.2.p on their DN10000's, I want to post somewhat more 
specifically what I found out since my last posting: 
(Sorry, this Article is quite long)

After the installation (invol, diskless on other DN10000, install++) 
of the os and the configuration of /etc/rc.local the DN10000 hangs 
while booting tcpd and routed after doing some (successfull) shutdowns
before.

That means it seems to stop at "Starting standard daemons: tcpd routed"
for about 20 minutes before reaching the dm (in fact everyting is 
only going a bit slowly).
When it reaches the dm, every usage of programs like 'ping', 'telnet',
'netstat' causes the node to hang for several minutes with a subsequent 
error message like 'router/udp: unknown service' or 'icmp: unknown 
protocol'. 
When issuing any dm-command while hanging, the cursor disappears and 
nothing else happens for some minutes.
Other people told me, that their machines don't reach the dm at all.

The node boots correctly without the internet processes (I just removed 
tcpd, routed, inetd from '/etc/daemons').

Our second DN10000 (240F3) has been working for several months with 
SR10.2.p and the internet services. (I just had to issue things like 
'mkdev /dev pty' from time to time)

The problem why it doesn't boot seems to be located in the directory
/sys/node_data/systmp, probably concerning the file 'tcp_data':

  - When I removed the directory 'systmp' within the Phase II - Shell
    and copied the directory //other-dn10000/sys/node_data.1b19c/systmp 
    to node 1b19c, it booted successfully, and the internet services
    worked fine, but only for a few shutdowns.(I used  our other DN10000 
    as a partner node while installing 1b19c, because the other
    DN10000 holds our AA for SR10.2.p)

  - When I pressed Ctrl-Return in service mode while hanging at "Starting
    standard...." the subsequent salvol told me:
    'the vtoc trouble flag has been set for the following objects:
                  /sys/node_data/systmp/tcp_data'
    and sometimes:
    'the vtoc trouble flag has been set for the following objects:
                  /sys/node_data/systmp/global_readonly'

  - When looking at the file tcp_data (after the salvol) from the Phase II
    Shell, its size was 528384 Bytes, whereas its normal size seems to be 
    1216512 Bytes. Every time the file has the former size, the internet 
    seems to fail.

Because there  m u s t  be some relevant difference I'm posting the exact
configurations of our machines to be compared with machines having the same
problem:

Node 1B19C
  2 CPU's, VS-Graphics, 16MB, 2x700MB Disks (sector striped) with two
  controllers, 1 Ethernet, 1 Apollo Token Ring controller.
  installed patches: p103, p108, p118, p119, p120, p124, p128, p130

Node 240F3
  4 CPU's, 128MB, 4x700MB Disk (sector striped) with two controllers, 
  2 Apollo Token Ring controllers.
  installed patches: same as above

At the moment I use a modified rc.local which removes the old systmp and 
copies a 'working' systmp to /sys/node_data before starting anything in 
rc.local.
This seems to work for now (including TCP/IP) but doesn't really fix the 
problem, it's just a 'dirty hack' (I know). Some attempts to remove tcp_data
within rc.local or to cat /dev/null on it failed.

The real problem might be somewhere in the filesystem, something that 
corrputs things in systmp at shutdown, thus preventing the tcpd to reopen 
tcp_data and to start correctly (That's just a guess, I'm not an expert)

P.S: I APR'ed on this via email, but didn't get any acknowledgement yet, 
     perhaps it failed ?

-----
Georg Schmid, ISD Uni Stuttgart, W.-Germany     
email: schmid@asterix.luftfahrt.uni-stuttgart.de
voice: 0(049-)711-685-2053
fax:   0(049-)711-685-3706

schmid@asterix.luftfahrt.uni-stuttgart.de (Georg Schmid) (07/10/90)

I just received mail from rees@citi.umich.edu. He says 

> It's a very bad idea to copy systmp from one node to another.  You should be
> able to just remove it and create an empty directory.

I suppose he didn't like my idea to copy around systmp, and I must confess
I didn't like it either (not very much at least), but I was just too stupid 
to find his more elegant solution. So what I do now at the beginning of 
rc.local is: 
   /bin/rm -rf /sys/node_data/systmp
   /bin/mkdir /sys/node_data/systmp

This seems to work !

(I tried to remove '/sys/node_data/systmp/*' before, which didn't work.)


-----
Georg Schmid, ISD University of  Stuttgart, W.-Germany     
email: schmid@asterix.luftfahrt.uni-stuttgart.de
voice: 0(049-)711-685-2053
fax:   0(049-)711-685-3706

rees@dabo.ifs.umich.edu (Jim Rees) (07/11/90)

In article <186@rusux1.rus.uni-stuttgart.de>,
schmid@asterix.luftfahrt.uni-stuttgart.de (Georg Schmid) writes:
    So what I do now at the beginning of 
    rc.local is: 
       /bin/rm -rf /sys/node_data/systmp
       /bin/mkdir /sys/node_data/systmp

I still think it's a bad idea to do this in rc.local.  I would personally
not try it except just before shutdown.  If you find it necessary to do this
on every boot, then something else is seriously broken and you should try to
track it down and fix it.

If you need an immediate workaround, I would try instead:
       /bin/rm -f /sys/node_data/systmp/tcp_data

schmid@jellosub.luftfahrt.uni-stuttgart.de (Georg Schmid) (07/11/90)

In article <1990Jul10.195225.24128@terminator.cc.umich.edu>,
rees@citi.umich.edu (Jim Rees) writes,

> In article <186@rusux1.rus.uni-stuttgart.de>,
> schmid@asterix.luftfahrt.uni-stuttgart.de (Georg Schmid) writes:
>    So what I do now at the beginning of 
>    rc.local is: 
>       /bin/rm -rf /sys/node_data/systmp
>       /bin/mkdir /sys/node_data/systmp
>
> I still think it's a bad idea to do this in rc.local.  I would personally
> not try it except just before shutdown.  If you find it necessary to do this
> on every boot, then something else is seriously broken and you should try to
> track it down and fix it.
>
> If you need an immediate workaround, I would try instead:
>       /bin/rm -f /sys/node_data/systmp/tcp_data

Well, I tried to rm /sys/node_data/systmp/tcp_data already (in fact this
was one of my very first experiments) but that didn't help, on the contrary
this seemed to make the node hanging for sure.

When I looked at the /sys/node_data directory in the Autorized Area, I saw
that there is no template for systmp, so this means, the time when 
systmp is created for the first time, has to be at boot time.

I couldn't figure out where this is done for nodes with disks, but for
diskless nodes it's done in /sys/net/netman.bin_sh.

When you look at that file, you can see that for diskless nodes the 
following happens:

 - When /sys/node_data.<diskless_node_id>/systmp doesen't exist on the 
   partner node, the directory systmp will be created with mode 777.
 - If systmp does exist, the directory (and its contents) will be
   unlocked by /etc/ulkob -f 

When I tried to fix the problem using /etc/ulkob, the node hung again.

For all that, up to my opinion it should be absolutely legal to remove
systmp and to recreate it with mode 777. 

The reason why I do this at the beginning of rc.local (perhaps there
might be some nicer/earlier place) is, that I want it to be done 
automatically on every boot. This is necessary because some guys work 
with I-DEAS and Domain Phigs on that machine, and it ist rather 
unpredictable when the next 'shutdown' will happen.

After all I'm happy with it working this way, and I don't know how to
track down the error further (I already spent a lot of time waiting for
salvol to complete). 
I do think that there is something seriously broken but I also think it's 
not my job to fix it, its the job of HP/Apollo. (The machine was 
installed completley new, including invol etc. so I guess it's not my
fault that it didn't work)

Thanks to all who helped me with their hints.
 
-----
Georg Schmid, ISD University of  Stuttgart, W.-Germany     
email: schmid@asterix.luftfahrt.uni-stuttgart.de (129.69.110.2)
voice: 0(049-)711-685-2053
fax:   0(049-)711-685-3706