schmid@asterix.luftfahrt.uni-stuttgart.de (Georg Schmid) (07/10/90)
Because I received mail from some people which have the same problems running SR10.2.p on their DN10000's, I want to post somewhat more specifically what I found out since my last posting: (Sorry, this Article is quite long) After the installation (invol, diskless on other DN10000, install++) of the os and the configuration of /etc/rc.local the DN10000 hangs while booting tcpd and routed after doing some (successfull) shutdowns before. That means it seems to stop at "Starting standard daemons: tcpd routed" for about 20 minutes before reaching the dm (in fact everyting is only going a bit slowly). When it reaches the dm, every usage of programs like 'ping', 'telnet', 'netstat' causes the node to hang for several minutes with a subsequent error message like 'router/udp: unknown service' or 'icmp: unknown protocol'. When issuing any dm-command while hanging, the cursor disappears and nothing else happens for some minutes. Other people told me, that their machines don't reach the dm at all. The node boots correctly without the internet processes (I just removed tcpd, routed, inetd from '/etc/daemons'). Our second DN10000 (240F3) has been working for several months with SR10.2.p and the internet services. (I just had to issue things like 'mkdev /dev pty' from time to time) The problem why it doesn't boot seems to be located in the directory /sys/node_data/systmp, probably concerning the file 'tcp_data': - When I removed the directory 'systmp' within the Phase II - Shell and copied the directory //other-dn10000/sys/node_data.1b19c/systmp to node 1b19c, it booted successfully, and the internet services worked fine, but only for a few shutdowns.(I used our other DN10000 as a partner node while installing 1b19c, because the other DN10000 holds our AA for SR10.2.p) - When I pressed Ctrl-Return in service mode while hanging at "Starting standard...." the subsequent salvol told me: 'the vtoc trouble flag has been set for the following objects: /sys/node_data/systmp/tcp_data' and sometimes: 'the vtoc trouble flag has been set for the following objects: /sys/node_data/systmp/global_readonly' - When looking at the file tcp_data (after the salvol) from the Phase II Shell, its size was 528384 Bytes, whereas its normal size seems to be 1216512 Bytes. Every time the file has the former size, the internet seems to fail. Because there m u s t be some relevant difference I'm posting the exact configurations of our machines to be compared with machines having the same problem: Node 1B19C 2 CPU's, VS-Graphics, 16MB, 2x700MB Disks (sector striped) with two controllers, 1 Ethernet, 1 Apollo Token Ring controller. installed patches: p103, p108, p118, p119, p120, p124, p128, p130 Node 240F3 4 CPU's, 128MB, 4x700MB Disk (sector striped) with two controllers, 2 Apollo Token Ring controllers. installed patches: same as above At the moment I use a modified rc.local which removes the old systmp and copies a 'working' systmp to /sys/node_data before starting anything in rc.local. This seems to work for now (including TCP/IP) but doesn't really fix the problem, it's just a 'dirty hack' (I know). Some attempts to remove tcp_data within rc.local or to cat /dev/null on it failed. The real problem might be somewhere in the filesystem, something that corrputs things in systmp at shutdown, thus preventing the tcpd to reopen tcp_data and to start correctly (That's just a guess, I'm not an expert) P.S: I APR'ed on this via email, but didn't get any acknowledgement yet, perhaps it failed ? ----- Georg Schmid, ISD Uni Stuttgart, W.-Germany email: schmid@asterix.luftfahrt.uni-stuttgart.de voice: 0(049-)711-685-2053 fax: 0(049-)711-685-3706
schmid@asterix.luftfahrt.uni-stuttgart.de (Georg Schmid) (07/10/90)
I just received mail from rees@citi.umich.edu. He says > It's a very bad idea to copy systmp from one node to another. You should be > able to just remove it and create an empty directory. I suppose he didn't like my idea to copy around systmp, and I must confess I didn't like it either (not very much at least), but I was just too stupid to find his more elegant solution. So what I do now at the beginning of rc.local is: /bin/rm -rf /sys/node_data/systmp /bin/mkdir /sys/node_data/systmp This seems to work ! (I tried to remove '/sys/node_data/systmp/*' before, which didn't work.) ----- Georg Schmid, ISD University of Stuttgart, W.-Germany email: schmid@asterix.luftfahrt.uni-stuttgart.de voice: 0(049-)711-685-2053 fax: 0(049-)711-685-3706
rees@dabo.ifs.umich.edu (Jim Rees) (07/11/90)
In article <186@rusux1.rus.uni-stuttgart.de>, schmid@asterix.luftfahrt.uni-stuttgart.de (Georg Schmid) writes: So what I do now at the beginning of rc.local is: /bin/rm -rf /sys/node_data/systmp /bin/mkdir /sys/node_data/systmp I still think it's a bad idea to do this in rc.local. I would personally not try it except just before shutdown. If you find it necessary to do this on every boot, then something else is seriously broken and you should try to track it down and fix it. If you need an immediate workaround, I would try instead: /bin/rm -f /sys/node_data/systmp/tcp_data
schmid@jellosub.luftfahrt.uni-stuttgart.de (Georg Schmid) (07/11/90)
In article <1990Jul10.195225.24128@terminator.cc.umich.edu>, rees@citi.umich.edu (Jim Rees) writes, > In article <186@rusux1.rus.uni-stuttgart.de>, > schmid@asterix.luftfahrt.uni-stuttgart.de (Georg Schmid) writes: > So what I do now at the beginning of > rc.local is: > /bin/rm -rf /sys/node_data/systmp > /bin/mkdir /sys/node_data/systmp > > I still think it's a bad idea to do this in rc.local. I would personally > not try it except just before shutdown. If you find it necessary to do this > on every boot, then something else is seriously broken and you should try to > track it down and fix it. > > If you need an immediate workaround, I would try instead: > /bin/rm -f /sys/node_data/systmp/tcp_data Well, I tried to rm /sys/node_data/systmp/tcp_data already (in fact this was one of my very first experiments) but that didn't help, on the contrary this seemed to make the node hanging for sure. When I looked at the /sys/node_data directory in the Autorized Area, I saw that there is no template for systmp, so this means, the time when systmp is created for the first time, has to be at boot time. I couldn't figure out where this is done for nodes with disks, but for diskless nodes it's done in /sys/net/netman.bin_sh. When you look at that file, you can see that for diskless nodes the following happens: - When /sys/node_data.<diskless_node_id>/systmp doesen't exist on the partner node, the directory systmp will be created with mode 777. - If systmp does exist, the directory (and its contents) will be unlocked by /etc/ulkob -f When I tried to fix the problem using /etc/ulkob, the node hung again. For all that, up to my opinion it should be absolutely legal to remove systmp and to recreate it with mode 777. The reason why I do this at the beginning of rc.local (perhaps there might be some nicer/earlier place) is, that I want it to be done automatically on every boot. This is necessary because some guys work with I-DEAS and Domain Phigs on that machine, and it ist rather unpredictable when the next 'shutdown' will happen. After all I'm happy with it working this way, and I don't know how to track down the error further (I already spent a lot of time waiting for salvol to complete). I do think that there is something seriously broken but I also think it's not my job to fix it, its the job of HP/Apollo. (The machine was installed completley new, including invol etc. so I guess it's not my fault that it didn't work) Thanks to all who helped me with their hints. ----- Georg Schmid, ISD University of Stuttgart, W.-Germany email: schmid@asterix.luftfahrt.uni-stuttgart.de (129.69.110.2) voice: 0(049-)711-685-2053 fax: 0(049-)711-685-3706