[comp.sys.apollo] Diskless node boot problems

hdtodd@eagle.wesleyan.edu (04/12/91)

	This is a followup with the resolution of a problem I posed a
couple of weeks ago involving diskless 2500's that will not boot reliably
over Ethernet.  I'm posting this for general information in the Apollo
community so others who might run into the problem would have a hint as to
where to look.  I realize the configuration we're running is pretty
non-standard, but the results might be useful to others anyway.


Situation

	We have two diskless 2500's in a student lab.  The disked 2500 is
in a faculty office.  We use Ethernet for campus networking.  The faculty
offices and student labs are separated by a bridge for security reasons.
We use PCBRIDGE, a freeware program written by Vance Morrison at
Northwestern, running on an AT&T PC6300 with two WD8003 boards (6.5KB
buffers) for Ethernet interfaces.  There is other traffic through the
bridge (notably, faculty offices accessing the Novell server in the
student lab and student TCP terminal traffic to central systems).


Problem

	In a standalone configuration with just the three machines on a
thinwire segment, the two diskless nodes would boot -- even concurrently
-- with no problems.  Performance was quite respectable.

	In the bridged-Ethernet configuration, the diskless 2500's would
not boot reliably.  They would frequently hang during either the load of
the diagnostic code or the load of the OS.  If they passed those two
loads, the startup of program code (uwm, for example) was terribly slow,
marked by long periods of EN inactivity (judging by the console lights).
Booting was most often successful at odd hours when there was little
activity on the net.


Debugging

	We tried a variety of debugging techniques: turning off one
diskless system while booting the other (no strong correlation with
successful boots), moving one 2500 to the un-bridged side of the net
(completely successful boots), replacing the AT&T 6300 with a 16MHz
286-based NCR PC (no improvement in rate of successful boots).

	We monitored the net with a Sniffer.  We found that the successful
boots were marked by long sequences of 1130-byte packets from the server
following a brief request packet from the booting node.  Unsuccessful
boots had long sequences of 1130-byte packets that terminated abruptly,
apparently before the sequence completed.  No dialog from the booting node
followed (i.e., no attempt to ask for a retry.)

	Finally, we established a two-node net with Sniffer and monitored
the net with no bridge, on the diskless-node side of the bridged network,
and on the server side of the bridged network.  The server puts out one
1130-byte packet approximately every 2.5 msec when supplying the boot
code.  With the bridge in place, the server still produces a packet every
2.5 msec, but the bridge passes them to the booting node about 3.3-3.6
msec apart.  We speculate that after some period of time, the bridge
buffer becomes filled and it drops packets.  The booting node is unable to
recover from this and hangs.


Analysis

	It appears that the problem was caused by two factors: the
boot-ROM code to handle Apollo primary boot protocol is not robust
(dropped packets cause a hang) and the bridge we're using does not pass
packets as fast as they are generated by the server and so packets do get
dropped.  Since the Apollo code was likely developed with Token Ring in
mind as the medium, the non-rebust code is probably not a surprise -- and
not a problem for most Apollo users.

	It is interesting that using a faster PC for the bridge, but
keeping the 8-bit WD cards for Ethernet interfaces, did not solve the
problem.  We do not yet know if EN boards with larger buffers or 16-bit
boards in the bridge would solve the problem.  The AT&T configuration
we're using has been analyzed by Morrison as handling 3000 60-byte packets
per second (0.33msec interpacket spacing) and the 16MHz AT configuration
handles 6000 60-byte packets per second.  We might have speculated that
the latter, at least, should be able to keep up with the 2.5 msec
interpacket spacing from the server -- but it doesn't.  A commercial
bridge MIGHT handle this rate, but we haven't yet determined that.

	We have not yet got a working bridge system for this net.  If
readers who have made it this far have other possible solutions to the
problem, I'd love to hear about them!  Responses to 
hdtodd@mockingbird.wesleyan.edu would be very welcome.

	Hope this helps others.  

						David Todd
						Wesleyan University