alexis@panix.uucp (Alexis Rosen) (06/26/91)
(Since some serial guru outside of the A/UX world may have relevant info, I'm posting this to comp.unix.questions, but followups to comp.unix.aux only.) I've been having major grief with A/UX serial port behavior for quite a while now. Unlike previous problems (which remain unresolved), these latest problems make the entire machine utterly unusable when they occur. I may not have all of the relevant information I'd like, but the seriousness of this matter is prompting me to post early. I will follow up with more info as it becomes available. The machine is a Mac IIx with 8MB RAM, running A/UX 2.0.1. It has 4 disks totalling about 1200MB and two CommCard 4-port serial boards. Ever since the upgrade to 2.0.1, or a little after, we have experienced system crashes every few days. They'd be more frequent, but we reboot when we see symptoms appear, if we're around when they become evident. The crashes are all clist panics. The pre-crash symptoms are loss of character input or output. For long periods, terminals will go dead, only to behave normally for a second or two, and then go dead again. This behavior is observable on the console, in the CommandShell windows, and on built-in and add-on serial ports. The machine itself seems fine though- running processes behave normally, logging no errors, the file system's OK, etc. Of course uucico doesn't like things because it can't talk anymore. We are pretty sure this has nothing to do with the Mac environment. We rebooted a few days ago and never invoked the MacOS, even to run /etc/Login. Yet we just had serious "character disease" occur again fifteen minutes ago. (I managed to reboot, after trying to type "/etc/reboot" for about six minutes.) The big question is whether or not the add-on serial ports are causing problems. But in a way, it's irrelevant. They worked fine in 2.0 (at least in this respect), and if they're broken now then A/UX is broken. On the other hand, it may be that the serial drivers for the built-in ports are bad in 2.0.1. In that case, we'll know in a few days, because tonight we are moving all modems off the built-in ports onto the cards. (For those who might ask: the CommCard drivers make no use of anything past u-dot.) As for why I suspect the serial ports- well, first of all, clists are used by the serial drivers. And not much else. And the "character disease" seems to be exactly what you'd see if you had major clist problems. (But I'm not an expert here and am open to additional or contrary information.) The other reason is that this system, Panix, is (as c.u.aux readers probably know by now :-) probably the biggest byte-moving A/UX system around, excepting maybe the ftp machine at Apple (but that's all Ethernet). So I think we're hitting a condition in a few days that other systems wouldn't hit for weeks, and by then they'd probably have been rebooted for other reasons. As I said, I'll post more once we've experimented with not using the built-in ports at all. But in the meantime, if anybody at all has any ideas or suggestions, no matter how off the wall, I'd like to hear them. And do any other A/UX sites which use serial I/O ever see problems like this? Also, while I'm criticizing A/UX's serial drivers... I have finally figured out something that has puzzled me (and others) for months now. A few months ago, there was some debate about whether getty would properly back off of a serial port that cu, kermit, or uucico (or whatever) had dialed out on. I maintained that this works fine, as did several other people. Others, however, said that this did _not_ work and that the getty would start talking to the comm program or uucico. I thought that they were wrong, but they're not. The key difference on Panix seems to be whether or not we're using the built- in serial ports. If we use the CommCard ports, getty behaves politely and backs off. If we use the built-ins, getty and uucico clobber each other. This suggests to me that the built-in drivers are broken. They fail in some way, though I can't figure out how. (At first I thought that getty and uucico only look at the lock files in /usr/spool/uucp. But perhaps they try to lock the special files? Does this involve the driver? As you can tell I have only the faintest idea what I'm talking about here.) Lastly, I'm not sure what to ascribe this to, but it's probably the serial drivers: I don't think that the timing of modem control signals is being done right. In particular, we often see that modems will answer a call, and their DCD lights will go on, but the corresponding getty will never wake up. On the other end, when people log out, sometimes they'll get another getty, instead of seeing the line hang up on them, as it should. All in all the behavior of serial ports under A/UX is an entirely dreadful mess. I know that these are not perfect bug reports, and I will continue to gather more information, but I'm hoping that someone will turn up something, and that someone in A/UX developement will say "Aha! I know where that bug is!" and proceed to fix it. Well, it could happen... --- Alexis Rosen Owner/Sysadmin, PANIX Public Access Unix, NY alexis@panix.com {cmcl2,apple}!panix!alexis
monk@monk.ctg.Tandem.COM (Alfredo Moncayo) (06/27/91)
In article <1991Jun26.153544.1186@panix.uucp> you write: |> Ever since the upgrade to 2.0.1, or a little after, we have experienced |> system crashes every few days. They'd be more frequent, but we reboot |> when we see symptoms appear, if we're around when they become evident. |> The crashes are all clist panics. The pre-crash symptoms are loss of |> character input or output. For long periods, terminals will go dead, only to |> behave normally for a second or two, and then go dead again. This behavior |> is observable on the console, in the CommandShell windows, and on built-in |> and add-on serial ports. The machine itself seems fine though- running |> processes behave normally, logging no errors, the file system's OK, etc. |> |> --- |> Alexis Rosen |> Owner/Sysadmin, PANIX Public Access Unix, NY |> alexis@panix.com |> {cmcl2,apple}!panix!alexis Last week I posted a plea for help on what appears to be the same problem as yours. After installation of 2.0.1 everything worked fine for about 2 weeks. Then, much as you've described it, my machine just hangs periodically - sometimes the screen blanks, other times (especially while running X11) it'll just stop accepting input to the xterms, etc. I'm not exactly sure it is related to serial ports, given that my machine is a standalone unit on our net, with no other regular users. I do use a modem though. I can't seem to come up with anything that may be causing these system freezes - no error logs, no cores, nothing at all! I have found that it happens most frequently when there has been no activity on the machine for some time (weekends and evenings in particular). ...maybe 2.0 wasn't so bad after all... --- Al Moncayo Tandem Computers, Inc. monk@monk.ctg.tandem.com 19333 Vallco Parkway monk@dsg.tandem.com Cupertino, Ca
marcelo@deadzone.uucp (Marcelo Gallardo) (06/27/91)
alexis@panix.uucp (Alexis Rosen) writes: >And do any >other A/UX sites which use serial I/O ever see problems like this? I've experienced similar problems as well (although they aren't happening as often now). But I never had the patience to sit and wait to see if the port ever came back to life even for a moment. I had noticed that this generally happened when two precesses tried to grab the port at the same time, or even when the port was being used, and your UUPOLL script (wonderful script BTW) tried to access the port. I did some minor changing to your script to abort if it found ANY LCK file (since I only have one line) and things seemed to have cleared up. Chances are, we are looking at two different problems though. Oh well, I tried. -- Marcelo Gallardo marcelo%deadzone@princeton.edu Test and Evaluation Specialist ...!princeton!deadzone!marcelo Princeton University marcelo@sparcwood.princeton.edu Advanced Technologies and Applications (609) 258-5661
alexis@panix.uucp (Alexis Rosen) (06/27/91)
monk@monk.ctg.tandem.com writes: >Last week I posted a plea for help on what appears to be the same problem >as yours. After installation of 2.0.1 everything worked fine for about 2 >weeks. Then, much as you've described it, my machine just hangs periodically - >sometimes the screen blanks, other times (especially while running X11) it'll >just stop accepting input to the xterms, etc. I'm not exactly sure it is >related to serial ports, given that my machine is a standalone unit on our >net, with no other regular users. I do use a modem though. I can't seem to >come up with anything that may be causing these system freezes - no error >logs, no cores, nothing at all! I have found that it happens most frequently >when there has been no activity on the machine for some time (weekends and >evenings in particular). Well, I'd like it to be the same problem (more complaints, quicker fixes :-) but I suspect they're two different things entirely. But it's easy enough to figure out. When this problem happens to us, the machine isn't crashed! _Only_ char I/O is messed up. So the easiest way to check this is to install a cron job that touches a file every 5 or 10 minutes. Then, when you noticed that your machine is hung, wait a bit, and after you reboot, look at the file (BEFORE the cron job touches it again!). If the last time is after the apparent lockup, you may have my problem. If not, your system is really crashing, and you probably have a different problem. Did you ever try rebuilding your kernel, just for the heck of it? Does any of your software make heavy use of clists? How do you use the modem? Either of these might be relevant. --- Alexis Rosen Owner/Sysadmin, PANIX Public Access Unix, NY alexis@panix.com {cmcl2,apple}!panix!alexis
alexis@panix.uucp (Alexis Rosen) (06/28/91)
marcelo@deadzone.uucp (Marcelo Gallardo) writes: >alexis@panix.uucp (Alexis Rosen) writes: >>And do any >>other A/UX sites which use serial I/O ever see problems like this? > > I've experienced similar problems as well (although they aren't > happening as often now). But I never had the patience to sit and > wait to see if the port ever came back to life even for a > moment. The problems start small, at least sometimes. So for a while, you wonder whether you're just not hitting the keyboard keys hard enough. > I had noticed that this generally happened when two precesses > tried to grab the port at the same time, or even when the port > was being used, and your UUPOLL script (wonderful script BTW) > tried to access the port. (Thanks!) Could you go into more detail on this? The whole point of that script was to _not_ do any damage if something else owns the port that the script is trying to use. If that weren't the case here, this system would be completely unusable- we're dialing in and out all the time. --- Alexis Rosen Owner/Sysadmin, PANIX Public Access Unix, NY alexis@panix.com {cmcl2,apple}!panix!alexis
marcelo@deadzone.uucp (Marcelo Gallardo) (07/01/91)
alexis@panix.uucp (Alexis Rosen) writes: >> I had noticed that this generally happened when two precesses >> tried to grab the port at the same time, or even when the port >> was being used, and your UUPOLL script (wonderful script BTW) >> tried to access the port. >(Thanks!) Could you go into more detail on this? The whole point of that >script was to _not_ do any damage if something else owns the port that the >script is trying to use. If that weren't the case here, this system would >be completely unusable- we're dialing in and out all the time. When I would grab the tty1 port (to call out via cu), and UUPOLL would try to grab the port, it would sometimes freeze up the system. It was as if it didn't see the LCK file. I hardcoded the LCK check into the script, and I haven't had a problem since. I don't know why it was "ignoring" the LCK as you had it written, since it obviously seems to work for you, but it didn't here. It would then try to set the port, and that's when things got messy. -- Marcelo Gallardo marcelo%deadzone@princeton.edu Test and Evaluation Specialist ...!princeton!deadzone!marcelo Princeton University marcelo@sparcwood.princeton.edu Advanced Technologies and Applications (609) 258-5661