[comp.unix.aux] Major problems with A/UX serial drivers?

alexis@panix.uucp (Alexis Rosen) (06/26/91)

(Since some serial guru outside of the A/UX world may have relevant info,
I'm posting this to comp.unix.questions, but followups to comp.unix.aux only.)

I've been having major grief with A/UX serial port behavior for quite a
while now. Unlike previous problems (which remain unresolved), these
latest problems make the entire machine utterly unusable when they occur.

I may not have all of the relevant information I'd like, but the seriousness
of this matter is prompting me to post early. I will follow up with more info
as it becomes available. The machine is a Mac IIx with 8MB RAM, running
A/UX 2.0.1. It has 4 disks totalling about 1200MB and two CommCard 4-port
serial boards.

Ever since the upgrade to 2.0.1, or a little after, we have experienced
system crashes every few days. They'd be more frequent, but we reboot
when we see symptoms appear, if we're around when they become evident.
The crashes are all clist panics. The pre-crash symptoms are loss of
character input or output. For long periods, terminals will go dead, only to
behave normally for a second or two, and then go dead again. This behavior
is observable on the console, in the CommandShell windows, and on built-in
and add-on serial ports. The machine itself seems fine though- running
processes behave normally, logging no errors, the file system's OK, etc.
Of course uucico doesn't like things because it can't talk anymore.

We are pretty sure this has nothing to do with the Mac environment. We
rebooted a few days ago and never invoked the MacOS, even to run /etc/Login.
Yet we just had serious "character disease" occur again fifteen minutes ago.
(I managed to reboot, after trying to type "/etc/reboot" for about six
minutes.)

The big question is whether or not the add-on serial ports are causing
problems. But in a way, it's irrelevant. They worked fine in 2.0 (at least
in this respect), and if they're broken now then A/UX is broken. On the
other hand, it may be that the serial drivers for the built-in ports are
bad in 2.0.1. In that case, we'll know in a few days, because tonight we
are moving all modems off the built-in ports onto the cards. (For those who
might ask: the CommCard drivers make no use of anything past u-dot.)

As for why I suspect the serial ports- well, first of all, clists are used
by the serial drivers. And not much else. And the "character disease" seems
to be exactly what you'd see if you had major clist problems. (But I'm not
an expert here and am open to additional or contrary information.)

The other reason is that this system, Panix, is (as c.u.aux readers probably
know by now :-) probably the biggest byte-moving A/UX system around, excepting
maybe the ftp machine at Apple (but that's all Ethernet). So I think we're
hitting a condition in a few days that other systems wouldn't hit for weeks,
and by then they'd probably have been rebooted for other reasons.

As I said, I'll post more once we've experimented with not using the built-in
ports at all. But in the meantime, if anybody at all has any ideas or
suggestions, no matter how off the wall, I'd like to hear them. And do any
other A/UX sites which use serial I/O ever see problems like this?

Also, while I'm criticizing A/UX's serial drivers... I have finally figured
out something that has puzzled me (and others) for months now. A few months
ago, there was some debate about whether getty would properly back off of a
serial port that cu, kermit, or uucico (or whatever) had dialed out on. I
maintained that this works fine, as did several other people. Others, however,
said that this did _not_ work and that the getty would start talking to the
comm program or uucico. I thought that they were wrong, but they're not.

The key difference on Panix seems to be whether or not we're using the built-
in serial ports. If we use the CommCard ports, getty behaves politely and
backs off. If we use the built-ins, getty and uucico clobber each other. This
suggests to me that the built-in drivers are broken. They fail in some way,
though I can't figure out how. (At first I thought that getty and uucico
only look at the lock files in /usr/spool/uucp. But perhaps they try to lock
the special files? Does this involve the driver? As you can tell I have only
the faintest idea what I'm talking about here.)

Lastly, I'm not sure what to ascribe this to, but it's probably the serial
drivers: I don't think that the timing of modem control signals is being done
right. In particular, we often see that modems will answer a call, and their
DCD lights will go on, but the corresponding getty will never wake up. On
the other end, when people log out, sometimes they'll get another getty,
instead of seeing the line hang up on them, as it should.


All in all the behavior of serial ports under A/UX is an entirely dreadful
mess. I know that these are not perfect bug reports, and I will continue to
gather more information, but I'm hoping that someone will turn up something,
and that someone in A/UX developement will say "Aha! I know where that bug
is!" and proceed to fix it. Well, it could happen...

---
Alexis Rosen
Owner/Sysadmin, PANIX Public Access Unix, NY
alexis@panix.com
{cmcl2,apple}!panix!alexis

monk@monk.ctg.Tandem.COM (Alfredo Moncayo) (06/27/91)

In article <1991Jun26.153544.1186@panix.uucp> you write:

|> Ever since the upgrade to 2.0.1, or a little after, we have experienced
|> system crashes every few days. They'd be more frequent, but we reboot
|> when we see symptoms appear, if we're around when they become evident.
|> The crashes are all clist panics. The pre-crash symptoms are loss of
|> character input or output. For long periods, terminals will go dead, only to
|> behave normally for a second or two, and then go dead again. This behavior
|> is observable on the console, in the CommandShell windows, and on built-in
|> and add-on serial ports. The machine itself seems fine though- running
|> processes behave normally, logging no errors, the file system's OK, etc.
|> 
|> ---
|> Alexis Rosen
|> Owner/Sysadmin, PANIX Public Access Unix, NY
|> alexis@panix.com
|> {cmcl2,apple}!panix!alexis


Last week I posted a plea for help on what appears to be the same problem
as yours.  After installation of 2.0.1 everything worked fine for about 2
weeks.  Then, much as you've described it, my machine just hangs periodically -
sometimes the screen blanks, other times (especially while running X11) it'll
just stop accepting input to the xterms, etc.  I'm not exactly sure it is 
related to serial ports, given that my machine is a standalone unit on our 
net, with no other regular users.  I do use a modem though.  I can't seem to
come up with anything that may be causing these system freezes - no error
logs, no cores, nothing at all!  I have found that it happens most frequently
when there has been no activity on the machine for some time (weekends and
evenings in particular).  

...maybe 2.0 wasn't so bad after all...

---
Al Moncayo				Tandem Computers, Inc.
monk@monk.ctg.tandem.com		19333 Vallco Parkway
monk@dsg.tandem.com			Cupertino, Ca 

marcelo@deadzone.uucp (Marcelo Gallardo) (06/27/91)

alexis@panix.uucp (Alexis Rosen) writes:

>And do any
>other A/UX sites which use serial I/O ever see problems like this?

	I've experienced similar problems as well (although they aren't
	happening as often now). But I never had the patience to sit and
	wait to see if the port ever came back to life even for a
	moment. 

	I had noticed that this generally happened when two precesses
	tried to grab the port at the same time, or even when the port
	was being used, and your UUPOLL script (wonderful script BTW)
	tried to access the port.

	I did some minor changing to your script to abort if it found
	ANY LCK file (since I only have one line) and things seemed to
	have cleared up. 

	Chances are, we are looking at two different problems though. Oh
	well, I tried.



-- 
Marcelo Gallardo			marcelo%deadzone@princeton.edu
Test and Evaluation Specialist		...!princeton!deadzone!marcelo
Princeton University			marcelo@sparcwood.princeton.edu
Advanced Technologies and Applications		(609) 258-5661

alexis@panix.uucp (Alexis Rosen) (06/27/91)

monk@monk.ctg.tandem.com writes:
>Last week I posted a plea for help on what appears to be the same problem
>as yours.  After installation of 2.0.1 everything worked fine for about 2
>weeks.  Then, much as you've described it, my machine just hangs periodically -
>sometimes the screen blanks, other times (especially while running X11) it'll
>just stop accepting input to the xterms, etc.  I'm not exactly sure it is 
>related to serial ports, given that my machine is a standalone unit on our 
>net, with no other regular users.  I do use a modem though.  I can't seem to
>come up with anything that may be causing these system freezes - no error
>logs, no cores, nothing at all!  I have found that it happens most frequently
>when there has been no activity on the machine for some time (weekends and
>evenings in particular).  

Well, I'd like it to be the same problem (more complaints, quicker fixes :-)
but I suspect they're two different things entirely. But it's easy enough to
figure out. When this problem happens to us, the machine isn't crashed! _Only_
char I/O is messed up. So the easiest way to check this is to install a cron
job that touches a file every 5 or 10 minutes. Then, when you noticed that
your machine is hung, wait a bit, and after you reboot, look at the file
(BEFORE the cron job touches it again!). If the last time is after the apparent
lockup, you may have my problem. If not, your system is really crashing, and
you probably have a different problem. Did you ever try rebuilding your
kernel, just for the heck of it?

Does any of your software make heavy use of clists? How do you use the modem?
Either of these might be relevant.

---
Alexis Rosen
Owner/Sysadmin, PANIX Public Access Unix, NY
alexis@panix.com
{cmcl2,apple}!panix!alexis

alexis@panix.uucp (Alexis Rosen) (06/28/91)

marcelo@deadzone.uucp (Marcelo Gallardo) writes:
>alexis@panix.uucp (Alexis Rosen) writes:
>>And do any
>>other A/UX sites which use serial I/O ever see problems like this?
>
>	I've experienced similar problems as well (although they aren't
>	happening as often now). But I never had the patience to sit and
>	wait to see if the port ever came back to life even for a
>	moment. 

The problems start small, at least sometimes. So for a while, you wonder
whether you're just not hitting the keyboard keys hard enough.

>	I had noticed that this generally happened when two precesses
>	tried to grab the port at the same time, or even when the port
>	was being used, and your UUPOLL script (wonderful script BTW)
>	tried to access the port.

(Thanks!) Could you go into more detail on this? The whole point of that
script was to _not_ do any damage if something else owns the port that the
script is trying to use. If that weren't the case here, this system would
be completely unusable- we're dialing in and out all the time.

---
Alexis Rosen
Owner/Sysadmin, PANIX Public Access Unix, NY
alexis@panix.com
{cmcl2,apple}!panix!alexis

marcelo@deadzone.uucp (Marcelo Gallardo) (07/01/91)

alexis@panix.uucp (Alexis Rosen) writes:

>>	I had noticed that this generally happened when two precesses
>>	tried to grab the port at the same time, or even when the port
>>	was being used, and your UUPOLL script (wonderful script BTW)
>>	tried to access the port.

>(Thanks!) Could you go into more detail on this? The whole point of that
>script was to _not_ do any damage if something else owns the port that the
>script is trying to use. If that weren't the case here, this system would
>be completely unusable- we're dialing in and out all the time.
	
	When I would grab the tty1 port (to call out via cu), and UUPOLL
	would try to grab the port, it would sometimes freeze up the
	system. It was as if it didn't see the LCK file. I hardcoded the
	LCK check into the script, and I haven't had a problem since.

	I don't know why it was "ignoring" the LCK as you had it
	written, since it obviously seems to work for you, but it didn't
	here. It would then try to set the port, and that's when things
	got messy.

-- 
Marcelo Gallardo			marcelo%deadzone@princeton.edu
Test and Evaluation Specialist		...!princeton!deadzone!marcelo
Princeton University			marcelo@sparcwood.princeton.edu
Advanced Technologies and Applications		(609) 258-5661