hanche@imf.unit.no (Harald Hanche-Olsen) (02/26/91)
Does anybody know anything about the protocol used when booting diskless from another node? I am under the impression that the protocol used is rather less than robust, in the sense that it just hangs there with no attempts at recovery in case not everything runs smoothly. We have some diskless DN3000s booting off a 3500 here, and on the average it takes just one or two attempts to boot. Now, however, one of the 3000s is just impossible to get up and running, though on one occasion we had about 90% of the kernel loaded before it was stuck. Another node is showing similar signs, although with persistence we can usually get it up. I have reason to believe that the boot process can be disturbed by unrelated traffic on the net. In particular, it appears to be stumped by any ethernet packets addressed to the node: One day I realized, in the middle of the boot, that the Mac on my desk still had a telnet connection open to the 3000, so I closed the window -- presumably causing the Mac to send a `please close this connection' packet to the 3000 -- and promptly the o/s load was halted. So the question really comes down to this: Should we assume the node needs service, and call in the HPollo service guys, or should we call on the network gurus to look for funny stuff running around our ethernet? We are part of a big campus ethernet, and all kinds of traffic is on it: TCP/IP, DDN, Novell, DECnet, ... - Harald Hanche-Olsen <hanche@imf.unit.no> Division of Mathematical Sciences The Norwegian Institute of Technology N-7034 Trondheim, NORWAY
rees@pisa.ifs.umich.edu (Jim Rees) (02/26/91)
In article <HANCHE.91Feb25182545@hufsa.imf.unit.no>, hanche@imf.unit.no (Harald Hanche-Olsen) writes:
Does anybody know anything about the protocol used when booting
diskless from another node?
The net read routine in the PROM is very simple-minded. It does no timeout/
retransmission. That means that if you have trouble loading netboot, you
have to start over. I don't remember whether netboot (the thing that says
".......loaded xxx bytes" does much better, but you used to be able to get
it to retry by hitting the space bar.
You can also read the stdout from netman (by running it in a pad, for
example) and it may tell you about problems it's having communicating with
the partner.
vinoski@apollo.HP.COM (Stephen Vinoski) (02/28/91)
In article <50081142.1bc5b@pisa.ifs.umich.edu> rees@citi.umich.edu (Jim Rees) writes: >In article <HANCHE.91Feb25182545@hufsa.imf.unit.no>, hanche@imf.unit.no (Harald Hanche-Olsen) writes: > Does anybody know anything about the protocol used when booting > diskless from another node? >The net read routine in the PROM is very simple-minded. It does no timeout/ >retransmission. That means that if you have trouble loading netboot, you >have to start over. I don't remember whether netboot (the thing that says >".......loaded xxx bytes" does much better, but you used to be able to get >it to retry by hitting the space bar. Jim is right (of course :-), the PROM network routines are quite dumb. Harald mentioned that any non-boot packets sent to the diskless node seem to cause trouble; this is because at the PROM level, there are no such things as separate sockets - the PROM believes every packet is intended for it. A non-boot packet would contain the wrong format and could cause the trouble he is seeing. One thing that even Jim might not be aware of is that netboot was augmented back around the sr10 timeframe to keep track of the boot pages received and retry for the ones it didn't get. We found that a surprising number of pages were being dropped and had to add this checking. The retry mechanism is not real smart but it appears to work. The boot programs have to fit into very limited memory space so more robust checking is not possible. >You can also read the stdout from netman (by running it in a pad, for >example) and it may tell you about problems it's having communicating with >the partner. I would recommend this. Like I said above, netboot has to run on bare hardware in a very tight memory space, so it doesn't handle errors too gracefully. The netman program, however, has the full power of the OS underneath it, so it can be a little more verbose about what it's doing. -steve | Steve Vinoski (508)256-0176 x5904 | Internet: vinoski@apollo.hp.com | | HP Apollo Division, Chelmsford, MA 01824 | UUCP: ...!apollo!vinoski | | "The price of knowledge is learning how little of it you yourself harbor." | | - Tom Christiansen |
hanche@imf.unit.no (Harald Hanche-Olsen) (03/01/91)
In article <5011bdff.20b6d@apollo.HP.COM> vinoski@apollo.HP.COM (Stephen Vinoski) writes: In article <50081142.1bc5b@pisa.ifs.umich.edu> rees@citi.umich.edu (Jim Rees) writes: >In article <HANCHE.91Feb25182545@hufsa.imf.unit.no>, hanche@imf.unit.no (Harald Hanche-Olsen) writes: > Does anybody know anything about the protocol used when booting > diskless from another node? >The net read routine in the PROM is very simple-minded. It does no timeout/ >retransmission. That means that if you have trouble loading netboot, you >have to start over. I don't remember whether netboot (the thing that says >".......loaded xxx bytes" does much better, but you used to be able to get >it to retry by hitting the space bar. I tried the space bar and it doesn't help. >You can also read the stdout from netman (by running it in a pad, for >example) and it may tell you about problems it's having communicating with >the partner. I would recommend this. Like I said above, netboot has to run on bare hardware in a very tight memory space, so it doesn't handle errors too gracefully. The netman program, however, has the full power of the OS underneath it, so it can be a little more verbose about what it's doing. I tried this, with the following result: # /sys/net/netman NETMAN -- User level network server -- 1990/05/17 ----- Message received from node 9952 on 2/26/1991 at 5:03:54 PM. --- Sysboot Request from Node 9952 Rqst for file: "netboot". ----- Message received from node 9952 on 2/26/1991 at 5:03:55 PM. --- Load Range Request --- DOMAIN_OS (0:15) Rqst for file: "//mummi/sau8/DOMAIN_OS". ----- Message received from node 9952 on 2/26/1991 at 5:03:55 PM. --- Load Range Request --- DOMAIN_OS (16:31) Rqst for file: "//mummi/sau8/DOMAIN_OS". ( intervening messages excruciatingly boring, hence omitted ) ----- Message received from node 9952 on 2/26/1991 at 5:04:03 PM. --- Load Range Request --- DOMAIN_OS (160:175) Rqst for file: "//mummi/sau8/DOMAIN_OS". ==== And here it stopped. Meanwhile, on the screen of node 9952, something like ... 00004000 BYTES LOADED. ... 00008000, ... C000, 10000, 14000, 18000, 1C000, 20000, 24000, ... 00028000 BYTES LOADED. (Stuck at beginnng of next line) At this point I tried Jim's suggestion, and hit the space bar. No response on either node. I did some calculation on the above numbers: Netman reports load range as kilobytes, in decimal. After the next-to-last request has been honored, (144:159) we have loaded 160KB or 16#28000 bytes, as is last reported on 9952's screen. Then more pages are requested, but not one of them is received. Well, thanks for the help. I guess it's time to give the hardware a good shakedown... - Harald Hanche-Olsen <hanche@imf.unit.no> Division of Mathematical Sciences The Norwegian Institute of Technology N-7034 Trondheim, NORWAY
krowitz@RICHTER.MIT.EDU (David Krowitz) (03/01/91)
You know, your problem is beginning to look familiar ... We have several diskless machines here at MIT which are booted off of a variety of nodes. A little while ago, our AA machine had its disk drive go up in smoke. Suddenly, none of the diskless machines would boot off of their partners *except* for the machines whose partners had a full OS load (ie. *NONE* of SR10.2 on the partner machine was installed as a link to the AA machine). I searched and searched for what could conceivably be the missing file and didn't find anything ... yet, when I re-installed the OS to point to the new AA machine, the diskless nodes could boot again. -- David Krowitz krowitz@richter.mit.edu (18.83.0.109) krowitz%richter.mit.edu@eddie.mit.edu krowitz%richter.mit.edu@mitvma.bitnet (in order of decreasing preference)