[comp.sys.apollo] Problems booting diskless on Ethernet

hanche@imf.unit.no (Harald Hanche-Olsen) (02/26/91)

Does anybody know anything about the protocol used when booting
diskless from another node?  I am under the impression that the
protocol used is rather less than robust, in the sense that it just
hangs there with no attempts at recovery in case not everything runs
smoothly.  We have some diskless DN3000s booting off a 3500 here, and
on the average it takes just one or two attempts to boot.  Now,
however, one of the 3000s is just impossible to get up and running,
though on one occasion we had about 90% of the kernel loaded before it
was stuck.  Another node is showing similar signs, although with
persistence we can usually get it up.

I have reason to believe that the boot process can be disturbed by
unrelated traffic on the net.  In particular, it appears to be stumped
by any ethernet packets addressed to the node:  One day I realized, in
the middle of the boot, that the Mac on my desk still had a telnet
connection open to the 3000, so I closed the window -- presumably
causing the Mac to send a `please close this connection' packet to the
3000 -- and promptly the o/s load was halted.

So the question really comes down to this:  Should we assume the node
needs service, and call in the HPollo service guys, or should we call
on the network gurus to look for funny stuff running around our
ethernet?  We are part of a big campus ethernet, and all kinds of
traffic is on it:  TCP/IP, DDN, Novell, DECnet, ...

- Harald Hanche-Olsen <hanche@imf.unit.no>
  Division of Mathematical Sciences
  The Norwegian Institute of Technology
  N-7034 Trondheim, NORWAY

rees@pisa.ifs.umich.edu (Jim Rees) (02/26/91)

In article <HANCHE.91Feb25182545@hufsa.imf.unit.no>, hanche@imf.unit.no (Harald Hanche-Olsen) writes:

  Does anybody know anything about the protocol used when booting
  diskless from another node?

The net read routine in the PROM is very simple-minded.  It does no timeout/
retransmission.  That means that if you have trouble loading netboot, you
have to start over.  I don't remember whether netboot (the thing that says
".......loaded xxx bytes" does much better, but you used to be able to get
it to retry by hitting the space bar.

You can also read the stdout from netman (by running it in a pad, for
example) and it may tell you about problems it's having communicating with
the partner.

vinoski@apollo.HP.COM (Stephen Vinoski) (02/28/91)

In article <50081142.1bc5b@pisa.ifs.umich.edu> rees@citi.umich.edu (Jim Rees) writes:
>In article <HANCHE.91Feb25182545@hufsa.imf.unit.no>, hanche@imf.unit.no (Harald Hanche-Olsen) writes:
>  Does anybody know anything about the protocol used when booting
>  diskless from another node?
>The net read routine in the PROM is very simple-minded.  It does no timeout/
>retransmission.  That means that if you have trouble loading netboot, you
>have to start over.  I don't remember whether netboot (the thing that says
>".......loaded xxx bytes" does much better, but you used to be able to get
>it to retry by hitting the space bar.

Jim is right (of course :-), the PROM network routines are quite dumb.  Harald
mentioned that any non-boot packets sent to the diskless node seem to cause
trouble; this is because at the PROM level, there are no such things as separate
sockets - the PROM believes every packet is intended for it.  A non-boot packet
would contain the wrong format and could cause the trouble he is seeing.

One thing that even Jim might not be aware of is that netboot was augmented back
around the sr10 timeframe to keep track of the boot pages received and retry for
the ones it didn't get.  We found that a surprising number of pages were being
dropped and had to add this checking.  The retry mechanism is not real smart but
it appears to work.  The boot programs have to fit into very limited memory
space so more robust checking is not possible.

>You can also read the stdout from netman (by running it in a pad, for
>example) and it may tell you about problems it's having communicating with
>the partner.

I would recommend this.  Like I said above, netboot has to run on bare hardware
in a very tight memory space, so it doesn't handle errors too gracefully.  The
netman program, however, has the full power of the OS underneath it, so it can
be a little more verbose about what it's doing.

-steve

| Steve Vinoski  (508)256-0176 x5904       | Internet: vinoski@apollo.hp.com  |
| HP Apollo Division, Chelmsford, MA 01824 | UUCP: ...!apollo!vinoski         |
| "The price of knowledge is learning how little of it you yourself harbor."  |
|                                                    - Tom Christiansen       |

hanche@imf.unit.no (Harald Hanche-Olsen) (03/01/91)

In article <5011bdff.20b6d@apollo.HP.COM> vinoski@apollo.HP.COM (Stephen Vinoski) writes:

   In article <50081142.1bc5b@pisa.ifs.umich.edu> rees@citi.umich.edu (Jim Rees) writes:
   >In article <HANCHE.91Feb25182545@hufsa.imf.unit.no>, hanche@imf.unit.no (Harald Hanche-Olsen) writes:
   >  Does anybody know anything about the protocol used when booting
   >  diskless from another node?
   >The net read routine in the PROM is very simple-minded.  It does no timeout/
   >retransmission.  That means that if you have trouble loading netboot, you
   >have to start over.  I don't remember whether netboot (the thing that says
   >".......loaded xxx bytes" does much better, but you used to be able to get
   >it to retry by hitting the space bar.

I tried the space bar and it doesn't help.

   >You can also read the stdout from netman (by running it in a pad, for
   >example) and it may tell you about problems it's having communicating with
   >the partner.

   I would recommend this.  Like I said above, netboot has to run on bare hardware
   in a very tight memory space, so it doesn't handle errors too gracefully.  The
   netman program, however, has the full power of the OS underneath it, so it can
   be a little more verbose about what it's doing.

I tried this, with the following result:

# /sys/net/netman
NETMAN -- User level network server -- 1990/05/17

 ----- Message received from node 9952  on  2/26/1991 at  5:03:54 PM. 
--- Sysboot Request from Node 9952
Rqst for file: "netboot".

 ----- Message received from node 9952  on  2/26/1991 at  5:03:55 PM. 
--- Load Range Request ---  DOMAIN_OS (0:15)
Rqst for file: "//mummi/sau8/DOMAIN_OS".

 ----- Message received from node 9952  on  2/26/1991 at  5:03:55 PM. 
--- Load Range Request ---  DOMAIN_OS (16:31)
Rqst for file: "//mummi/sau8/DOMAIN_OS".

( intervening messages excruciatingly boring, hence omitted )

 ----- Message received from node 9952  on  2/26/1991 at  5:04:03 PM. 
--- Load Range Request ---  DOMAIN_OS (160:175)
Rqst for file: "//mummi/sau8/DOMAIN_OS".

==== And here it stopped.
Meanwhile, on the screen of node 9952, something like

...  00004000 BYTES LOADED.
...  00008000, ... C000, 10000, 14000, 18000, 1C000, 20000, 24000,
...  00028000 BYTES LOADED.
(Stuck at beginnng of next line)

At this point I tried Jim's suggestion, and hit the space bar.  No
response on either node.  I did some calculation on the above numbers:
Netman reports load range as kilobytes, in decimal.  After the
next-to-last request has been honored, (144:159) we have loaded 160KB
or 16#28000 bytes, as is last reported on 9952's screen.  Then more
pages are requested, but not one of them is received.

Well, thanks for the help.  I guess it's time to give the hardware a
good shakedown...

- Harald Hanche-Olsen <hanche@imf.unit.no>
  Division of Mathematical Sciences
  The Norwegian Institute of Technology
  N-7034 Trondheim, NORWAY

krowitz@RICHTER.MIT.EDU (David Krowitz) (03/01/91)

You know, your problem is beginning to look familiar ...

We have several diskless machines here at MIT which are booted off
of a variety of nodes. A little while ago, our AA machine had its
disk drive go up in smoke. Suddenly, none of the diskless machines
would boot off of their partners *except* for the machines whose
partners had a full OS load (ie. *NONE* of SR10.2 on the partner
machine was installed as a link to the AA machine). I searched and
searched for what could conceivably be the missing file and didn't
find anything ... yet, when I re-installed the OS to point to the
new AA machine, the diskless nodes could boot again.


 -- David Krowitz

krowitz@richter.mit.edu   (18.83.0.109)
krowitz%richter.mit.edu@eddie.mit.edu
krowitz%richter.mit.edu@mitvma.bitnet
(in order of decreasing preference)