hdtodd@eagle.wesleyan.edu (04/12/91)
This is a followup with the resolution of a problem I posed a couple of weeks ago involving diskless 2500's that will not boot reliably over Ethernet. I'm posting this for general information in the Apollo community so others who might run into the problem would have a hint as to where to look. I realize the configuration we're running is pretty non-standard, but the results might be useful to others anyway. Situation We have two diskless 2500's in a student lab. The disked 2500 is in a faculty office. We use Ethernet for campus networking. The faculty offices and student labs are separated by a bridge for security reasons. We use PCBRIDGE, a freeware program written by Vance Morrison at Northwestern, running on an AT&T PC6300 with two WD8003 boards (6.5KB buffers) for Ethernet interfaces. There is other traffic through the bridge (notably, faculty offices accessing the Novell server in the student lab and student TCP terminal traffic to central systems). Problem In a standalone configuration with just the three machines on a thinwire segment, the two diskless nodes would boot -- even concurrently -- with no problems. Performance was quite respectable. In the bridged-Ethernet configuration, the diskless 2500's would not boot reliably. They would frequently hang during either the load of the diagnostic code or the load of the OS. If they passed those two loads, the startup of program code (uwm, for example) was terribly slow, marked by long periods of EN inactivity (judging by the console lights). Booting was most often successful at odd hours when there was little activity on the net. Debugging We tried a variety of debugging techniques: turning off one diskless system while booting the other (no strong correlation with successful boots), moving one 2500 to the un-bridged side of the net (completely successful boots), replacing the AT&T 6300 with a 16MHz 286-based NCR PC (no improvement in rate of successful boots). We monitored the net with a Sniffer. We found that the successful boots were marked by long sequences of 1130-byte packets from the server following a brief request packet from the booting node. Unsuccessful boots had long sequences of 1130-byte packets that terminated abruptly, apparently before the sequence completed. No dialog from the booting node followed (i.e., no attempt to ask for a retry.) Finally, we established a two-node net with Sniffer and monitored the net with no bridge, on the diskless-node side of the bridged network, and on the server side of the bridged network. The server puts out one 1130-byte packet approximately every 2.5 msec when supplying the boot code. With the bridge in place, the server still produces a packet every 2.5 msec, but the bridge passes them to the booting node about 3.3-3.6 msec apart. We speculate that after some period of time, the bridge buffer becomes filled and it drops packets. The booting node is unable to recover from this and hangs. Analysis It appears that the problem was caused by two factors: the boot-ROM code to handle Apollo primary boot protocol is not robust (dropped packets cause a hang) and the bridge we're using does not pass packets as fast as they are generated by the server and so packets do get dropped. Since the Apollo code was likely developed with Token Ring in mind as the medium, the non-rebust code is probably not a surprise -- and not a problem for most Apollo users. It is interesting that using a faster PC for the bridge, but keeping the 8-bit WD cards for Ethernet interfaces, did not solve the problem. We do not yet know if EN boards with larger buffers or 16-bit boards in the bridge would solve the problem. The AT&T configuration we're using has been analyzed by Morrison as handling 3000 60-byte packets per second (0.33msec interpacket spacing) and the 16MHz AT configuration handles 6000 60-byte packets per second. We might have speculated that the latter, at least, should be able to keep up with the 2.5 msec interpacket spacing from the server -- but it doesn't. A commercial bridge MIGHT handle this rate, but we haven't yet determined that. We have not yet got a working bridge system for this net. If readers who have made it this far have other possible solutions to the problem, I'd love to hear about them! Responses to hdtodd@mockingbird.wesleyan.edu would be very welcome. Hope this helps others. David Todd Wesleyan University