bert@mebazf.UUCP (03/23/87)
Hi there. We have a problem here with our uucp system that has got me stumped. We are running 4.3 BSD on a VAX 750. The problem that we have is that incoming uucp calls seem to get fouled up unless we have local debug enabled. The problem seems to be that the 'g' protocol gets out of sink between the two machines and we start getting BAD READ (expected 'C' [or S or whatever] got FAIL (2)) messages in our LOGFILE. This problem does not occur when we poll other systems, nor does it occur when our uucp login executes uucico with local debug on. It also seems to depend on the 'strength' of the debugging level. For instance, 'uucico -x1' doesn't solve the problem, but '-x9' makes things run just dandy. Other details : - The problem occurs independent of whether the other system is calling over a modem or is hard-wired with a direct connection to our machine. - We didn't have any problems under 4.2 BSD. - The failure tends to happen early in the connection, normally during the transfer of the first file from us to them. Any solutions or suggestions would be greatly appreciated. Thanks. Bert Kay Mettler Instrumente AG, CH-8606, Greifensee, Switzerland ...ucbvax!mebazf!bert ...mcvax!cernvax!unizh!mebazf!bert
Robert_Toxen@harvard.harvard.EDU (03/30/87)
> From: Herbert Kay <bert@mebazf.uucp> > Subject: uucico fails on call-in > Date: 22 Mar 87 23:59:59 GMT > To: info-unix@brl-sem.arpa > > We have a problem here with our uucp system that has got me > stumped. We are running 4.3 BSD on a VAX 750. The problem > that we have is that incoming uucp calls seem to get fouled > up unless we have local debug enabled. The problem seems to > be that the 'g' protocol gets out of sink between the two > machines and we start getting > > BAD READ (expected 'C' [or S or whatever] got FAIL (2)) > > messages in our LOGFILE. This problem does not occur when > we poll other systems, nor does it occur when our uucp login > executes uucico with local debug on. It also seems to depend > on the 'strength' of the debugging level. For instance, > 'uucico -x1' doesn't solve the problem, but '-x9' makes things > run just dandy. > is calling over a modem or is hard-wired with a direct > connection to our machine. > > - We didn't have any problems under 4.2 BSD. > > - The failure tends to happen early in the connection, > normally during the transfer of the first file from us to them. > > Mettler Instrumente AG, CH-8606, Greifensee, Switzerland > ...ucbvax!mebazf!bert > ...mcvax!cernvax!unizh!mebazf!bert The "BAD READ (expected 'C' [or S or whatever] got FAIL (2))" means that the slave failed to respond to the master's request within the allowed time, generally 10 seconds for commands and 20 seconds for data. It typically happens when the slave system is very busy. Also, it can happen after the calling system (initially master) is done transfering its files (if any) and it asks the slave system if it has any requests. This appears to be what is happening on your system. The called system (which has just become the master) must look in its spool directory for requests and accumulate the first LLEN of them (LLEN is 50 and is defined in anlwrk.c) and if it finds any it then replies to the other system that it has a request. If there aren't any then it says all done. Obviously, the more stuff accumulated (up to 50 requests) the longer it takes for this operation to take place. Also, if the last connection failed in the middle of a request then it must decide where in the middle of the request to pick up. (This is what the A.* files are.) Thus the connection fails and more stuff can accumulate. Making failure more likely the next time. It doesn't matter whether it is a direct connect or modem. The UNIX version DOES matter because different versions (even the same type from different vendors) have a different timeout. Turning on debugging on the calling system helps the problem because it is writing debugging messages while the called system does some of its processing, effectively increasing the time allowed for the slave to respond. (If there is noise on the line causing data to be lost then all bets are off; uucico cannot always recover when there is a lot of noise when the two systems are trying to swap roles.) Unfortunately, this is not an easy problem to solve. Some suggestions are to have the calling system call at times that called system is not too busy, modify the calling systems' uucico for longer timeouts, or poll more often (to reduce the average number of files in the spool directory). You might move uucico to uucico2 and create a small "uucico" that is a set-UID to root that boosts its priority, does a setuid() to uucp (or uucpadm as the case may be) and then execs the real uucico2. Be careful about security bugs if you go this route. You might, on the called system, change this "number of files to accumulate" constant to be smaller and then recompile uucico. Look for "#define LLEN 50" in anlwrk.c. You might change the calling system to have a larger timeout by editing pkcget() in pk1.c and searching for "alarm(" and boost the constant. There are those on the net who claim that this may cause problems but we haven't observed any. Bob Toxen Stratus Computer, Marlboro, Mass. {ucbvax!ihnp4,harvard}!anvil!bob (Please use THIS address to reply) "Just say NO to drug TESTS; say YES to the Constitution."
rick@seismo.UUCP (03/30/87)
> You might, on the called system, change this "number of files to > accumulate" constant to be smaller and then recompile uucico. > Look for "#define LLEN 50" in anlwrk.c. Note that this will probably NOT work with 4.3 BSD or 4.2BSD uucicos. There is a really stupid bug in both the 4.3 and 4.2 anlwrk routines that make is ALWAYS read the entire directory even if it has already found LLEN entries. (When the code was converted to use the "new" readdir() routines, it wasn't quite done right). I have already posted patches to fix this problem (It really shows up when you have 6 or 7 thousand jobs queued up...). I can mail them if you didn't save them. --rick