[comp.unix.questions] uucico fails on call-in

bert@mebazf.UUCP (03/23/87)

Hi there.

We have a problem here with our uucp system that has got me 
stumped.  We are running 4.3 BSD on a VAX 750.  The problem 
that we have is that incoming uucp calls seem to get fouled 
up unless we have local debug enabled.  The problem seems to 
be that the 'g' protocol gets out of sink between the two 
machines and we start getting 

    BAD READ (expected 'C' [or S or whatever] got FAIL (2))

messages in our LOGFILE.  This problem does not occur when 
we poll other systems, nor does it occur when our uucp login 
executes uucico with local debug on.  It also seems to depend 
on the 'strength' of the debugging level.  For instance, 
'uucico -x1' doesn't solve the problem, but '-x9' makes things 
run just dandy.

Other details :   

    - The problem occurs independent of whether the other system 
      is calling over a modem or is hard-wired with a direct 
      connection to our machine.

    - We didn't have any problems under 4.2 BSD.

    - The failure tends to happen early in the connection, 
      normally during the transfer of the first file from us to them.

Any solutions or suggestions would be greatly appreciated.


	  Thanks.

	    Bert Kay
	    Mettler Instrumente AG, CH-8606, Greifensee, Switzerland
	    ...ucbvax!mebazf!bert
	    ...mcvax!cernvax!unizh!mebazf!bert

Robert_Toxen@harvard.harvard.EDU (03/30/87)

> From: Herbert Kay <bert@mebazf.uucp>
> Subject: uucico fails on call-in
> Date: 22 Mar 87 23:59:59 GMT
> To:       info-unix@brl-sem.arpa
> 
> We have a problem here with our uucp system that has got me 
> stumped.  We are running 4.3 BSD on a VAX 750.  The problem 
> that we have is that incoming uucp calls seem to get fouled 
> up unless we have local debug enabled.  The problem seems to 
> be that the 'g' protocol gets out of sink between the two 
> machines and we start getting 
> 
>     BAD READ (expected 'C' [or S or whatever] got FAIL (2))
> 
> messages in our LOGFILE.  This problem does not occur when 
> we poll other systems, nor does it occur when our uucp login 
> executes uucico with local debug on.  It also seems to depend 
> on the 'strength' of the debugging level.  For instance, 
> 'uucico -x1' doesn't solve the problem, but '-x9' makes things 
> run just dandy.
>       is calling over a modem or is hard-wired with a direct 
>       connection to our machine.
> 
>     - We didn't have any problems under 4.2 BSD.
> 
>     - The failure tends to happen early in the connection, 
>       normally during the transfer of the first file from us to them.
> 
> 	    Mettler Instrumente AG, CH-8606, Greifensee, Switzerland
> 	    ...ucbvax!mebazf!bert
> 	    ...mcvax!cernvax!unizh!mebazf!bert

The "BAD READ (expected 'C' [or S or  whatever]  got  FAIL  (2))"
means  that  the  slave failed to respond to the master's request
within the allowed time, generally 10 seconds for commands and 20
seconds  for data.  It typically happens when the slave system is
very busy.

Also, it can happen after the calling system  (initially  master)
is  done  transfering  its  files  (if any) and it asks the slave
system if it has any  requests.   This  appears  to  be  what  is
happening  on  your  system.   The  called system (which has just
become the master) must look in its spool directory for  requests
and  accumulate the first LLEN of them (LLEN is 50 and is defined
in anlwrk.c) and if it finds any it then  replies  to  the  other
system  that  it has a request.  If there aren't any then it says
all done.  Obviously,  the  more  stuff  accumulated  (up  to  50
requests)  the  longer it takes for this operation to take place.
Also, if the last connection failed in the middle  of  a  request
then  it  must  decide where in the middle of the request to pick
up.  (This is what the A.* files are.) Thus the connection  fails
and  more  stuff  can accumulate.  Making failure more likely the
next time.

It doesn't matter whether it is a direct connect or  modem.   The
UNIX  version  DOES  matter  because different versions (even the
same type from  different  vendors)  have  a  different  timeout.
Turning  on  debugging  on  the  calling system helps the problem
because it is writing debugging messages while the called  system
does  some  of  its  processing,  effectively increasing the time
allowed for the slave to respond.  (If there is noise on the line
causing  data  to  be  lost  then all bets are off; uucico cannot
always recover when there is a lot of noise when the two  systems
are trying to swap roles.)

Unfortunately, this is  not  an  easy  problem  to  solve.   Some
suggestions  are  to  have  the calling system call at times that
called system is not too busy, modify the calling systems' uucico
for  longer  timeouts,  or poll more often (to reduce the average
number of files in the spool directory).

You might move uucico to uucico2 and create a small "uucico" that
is a set-UID to root that boosts its priority, does a setuid() to
uucp (or uucpadm as the case may be)  and  then  execs  the  real
uucico2. Be careful about security bugs if you go this route.

You might, on the called system, change this "number of files  to
accumulate"  constant  to  be  smaller and then recompile uucico.
Look for "#define LLEN 50" in anlwrk.c.   You  might  change  the
calling  system  to  have a larger timeout by editing pkcget() in
pk1.c and searching for "alarm(" and boost the  constant.   There
are  those  on the net who claim that this may cause problems but
we haven't observed any.

Bob Toxen
Stratus Computer, Marlboro, Mass.
{ucbvax!ihnp4,harvard}!anvil!bob        (Please use THIS address to reply)
"Just say NO to drug TESTS; say YES to the Constitution."

rick@seismo.UUCP (03/30/87)

> You might, on the called system, change this "number of files  to
> accumulate"  constant  to  be  smaller and then recompile uucico.
> Look for "#define LLEN 50" in anlwrk.c.


Note that this will probably NOT work with  4.3 BSD or 4.2BSD uucicos.
There is a really stupid bug in both the 4.3 and 4.2 anlwrk routines
that make is ALWAYS read the entire directory even if it has already
found LLEN entries. (When the code was converted to use the "new" readdir()
routines, it wasn't quite done right).

I have already posted patches to fix this problem (It really shows
up when you have 6 or 7 thousand jobs queued up...). I can mail them
if you didn't save them.

--rick