[comp.sys.sun] Network problem: Sun-2 not recognizing ARP replies, ICMP Echo Req.

sthaug@idt.unit.no (Steinar Haug) (10/20/89)

[I have posted this to sun-managers and sun-nets earlier, without getting
any good suggestions. I'm hoping that the sun-spots readers can help me!]

We are having problems with a Sun-2 running 3.5 here, and I wonder if the
following description sounds familiar to any of you. The machine with the
problems is named sizex.

The machine's ypbind process suddenly discovered that its default domain
was unbound, but refused to rebind again. I have two YP servers, and I can
see with tcpdump (or etherfind) that sizex is sending queries to the
servers, and they are answering, but evidently sizex refuses to recognize
the answers.

I brought machine down to single-user mode and started digging deeper into
the problem. I found some interesting information:

1. Using ping from sizex to another machine I can see that sizex is
actually sending out ARP packets, and is getting replies. However, it does
not recognize the replies, and does not send any ICMP Echo Requests. After
a while ping times out, and the entry in the ARP cache is marked
(incomplete). But if instead I ping sizex from another machine, sizex
*will* enter the Ethernet address for this other machine in its ARP cache!
It will *not* answer the ICMP Echo Request from the other machine,
however.

2. If sizex *has* an entry for a machine in its ARP cache (obtained as
above) it *will* send out ICMP Echo Requests, and receive answers to this
from the other machine. But it still doesn't recognize the answers, and
ping times out after a while.

3. If sizex *has* an entry for a machine in its ARP cache and I try
telnetting to another machine, I get the following from tcpdump:

Script started on Tue Oct 17 00:05:52 1989
boheme# tcpdump ehost sizex

00:06:15.26  sizex.1025 > dorma.telnet: S 5124865:5124865(0) win 4096
 <mss 1024>
00:06:15.26  dorma.telnet > sizex.1025: S 49486209:49486209(0) ack 5124866
 win 4096
00:06:20.98  sizex.1025 > dorma.telnet: S 5124865:5124865(0) win 4096
 <mss 1024>
00:06:20.98  dorma.telnet > sizex.1025: . ack 1 win 4096
00:06:21.10  dorma.telnet > sizex.1025: S 49486209:49486209(0) ack 5124866
 win 4096
00:06:26.99  sizex.1025 > dorma.telnet: S 5124865:5124865(0) win 4096
 <mss 1024>
00:06:26.99  dorma.telnet > sizex.1025: . ack 1 win 4096
00:06:27.11  dorma.telnet > sizex.1025: S 49486209:49486209(0) ack 5124866
 win 4096
00:06:38.99  sizex.1025 > dorma.telnet: S 5124865:5124865(0) win 4096
 <mss 1024>
00:06:38.99  dorma.telnet > sizex.1025: . ack 1 win 4096
00:06:39.11  dorma.telnet > sizex.1025: S 49486209:49486209(0) ack 5124866
 win 4096
00:07:03.09  dorma.telnet > sizex.1025: S 49486209:49486209(0) ack 5124866
 win 4096
(connection timed out)

Again, it seems to me that sizex refuses to recognize the answer from the
other machine; it just keeps resending its initial message.

4. I have tried the above both with correct subnet mask (0xffffff00) to
ifconfig, and without a subnet mask. End result exactly the same. I also
tried rebuilding the kernel. No difference...

I'm stuck, any ideas out there? Thanks for all help!

Steinar Haug, System administrator
ELAB-RUNIT, University of Trondheim, Norway
Email: sthaug@idt.unit.no, steinar@flute.er.sintef.no