wood@cascade (04/17/86)
From: Ernest Wood <wood@cascade> FYI: Perhaps Brian and the experts out there are aware of this but I thought I'd post it anyway. The problem with news from Glacier right now is that the batch method from Glacier proceeds on a host by host basis with each host getting all of its batch files before the next host in line gets any. There is a bug somewhere with rcp (the method used to transfer the files) such that, in some cases, it doesn't timeout but at the same time won't complete the transfer to a particular host. If it would timeout the shell script performing the transfers would then proceed to the next host and ignore the remaining files destined to the apparently down machine. Since it doesn't but won't complete the script remains hung until the the next hourly invocation of the shell script. At this time the hung script (but NOT the rcp) is killed and a new one started. This one transfers everything it can until it too hangs on the funny host. And so it goes. Unfortunately navajo appears to be such a funny host and is also in about the middle of the list. The list is alphabetically sorted by the script or it would be possible to remove part of the problem by putting it last. Of course the other side of the problem is that glacier slowly accumulates a number of supposedly dead rcp's to navajo. For sometime I've noticed that this was happening to navajo but I always assumed this was coincidence. Apparently it isn't. WHY this is happening is another question altogether. I just find um, I don't explain um, -ernie