[comp.sys.apollo] getty problem at sr10.1

lray@CIVILGATE.CE.UIUC.EDU (Leland Ray) (11/12/89)

It all started when I moved the node to SR10.1. Until that time, it had been
a well behaved DN4000. Connected to the node were two serial lines, both
connecting to PCs in nearby offices.

Several days after the node was converted, the owner began to complain. The
node was slow, so she told me, and the disk was quickly filling up.

Upon checking the node, I noticed that there was no free disk space, and both
gettys were in an infinite loop. Rebooting the node recovered 70M in disk space.
"Gee," said I, "this looks like a problem with the OS. Perhaps the 800 number
could help me out."

A call to the 800 number failed to resolve the problem. It seems nobody at
 Apollo
had a problem with runaway gettys. The only possible suggestion was that I'd
created the login_log file and that was growing without bounds. Of course, if it
was a real file, rebooting the node would not recover that space -- I'd simply
find a huge object on the disk I could delete.

Well, if getty doesn't work, perhaps dropping back on the old way -- the SR9 way
-- would solve the problem. siomonit came up easily, and did not destroy the
cpu or the disk as did getty. However, it did create many processes called
 "level2"
and fill up the process table, prohibiting any use of the node until reboot. So
it seems I was stuck with getty.

The next thing I did was to take advantage of the fact that my university is
blessed with a Unix source license. I got the source to getty and spent a long
time looking at it. I decided that the best way to tackle the problem would be
 to
port the code to SR10.1 and use syslog calls to log significant program events.
And this is what showed me the problem.

It seems that the serial lines, for as yet unknown reasons, were flooding the
 line
with large numbers of garbage characters. Most of them were ^? (Ascii 0xff) with
a few o's (Ascii 0x7f) thrown in for good measure.

What was happening was this: getty takes is the input of the login name only as
many characters as are allowed, and after that, it terminates and passes the
string to login even if the user never presses a return key. The Apollo serial
drivers terminate the process silently after 255 characters (try holding down a
repeating character while logged into an Apollo's serial line), which init
faithfully notices and calls up another getty. The time spent in initializing
 the
line, getting the environment, looking it up in gettytab, etc. while bogus
characters are being dropped in the line takes up 99% of a DN4000.

The fix I've made to my local copy of getty is the following: If the length of
 the
login name exceeds the maxmimum number of permissible characters, drop all
additional characters on the floor except the erase character, the kill
 character,
and the EOF character. The loop of getty that reads in incoming characters and
drops them takes up at most 2% of the DN4000's time.

All told, here are my thoughts regarding this problem:

   1. The Unix community (as this is a general Unix bug, and not an Apollo
      specific one) should modify more programs like getty to give debug output
 so
      these kind of problems can be tracked without having access to the source.

   2. I've reported the serial line 256th character dropout problem to Apollo,
      and I'm told it will be "fixed in a future software release."

   3. The problem with the process table filling up with level2's is a serious
      one, and should be fixed. I suspect this is incomplete process death,
      however, I don't know enough about it to precisely duplicate it. In
      addition, the DN10000 (SR10.1.p) also seems to accumulate level2's.

--------------------------------------------------------------------------------
--
Just spendin' my days,                  Leland Ray
                                        Systems Administrator
  Soakin' in them cathode rays.         UIUC - Dept. Civil Engineering
                                        (217) 333-3821

jwright@atanasoff.cs.iastate.edu (Jim Wright) (11/12/89)

lray@CIVILGATE.CE.UIUC.EDU (Leland Ray) writes:
|    1. The Unix community (as this is a general Unix bug, and not an Apollo
|       specific one) should modify more programs like getty to give debug output
|  so
|       these kind of problems can be tracked without having access to the source.

Yeah, let's start by putting a debug mode into sendmail.

-- 
Jim Wright
jwright@atanasoff.cs.iastate.edu

weber_w@apollo.HP.COM (Walt Weber) (11/15/89)

In article <8911111819.AA02201@civilgate.ce.uiuc.edu> lray@CIVILGATE.CE.UIUC.EDU (Leland Ray) writes:
>
>It all started when I moved the node to SR10.1.
>
>Upon checking the node, I noticed that there was no free disk space, and both
>gettys were in an infinite loop. Rebooting the node recovered 70M in disk space.
>
>It seems that the serial lines, for as yet unknown reasons, were flooding the
> line with large numbers of garbage characters. Most of them were ^? 
> (Ascii 0xff) with a few o's (Ascii 0x7f) thrown in for good measure.
>
Leland -

First of all, it looks like you addressed the symptoms of the problem with your
mods to getty (like you said, good thing you were "blessed with the source").

It sure looks like the underlying CAUSE is yet to be addressed, however, since
you are still getting what appears to be line noise on the serial port.  The
behavior is similar to that observed when a modem is attached for dial-out, but
is not configured to eliminate result codes (like any Hayes compatible, for
example).  Thus getty sends a prompt to the modem, the modem sends back either
OK or ERROR (which getty interprets as a login user name & exec's login), login
prompts for a password and exits on the timeout signal, which respawns getty,
and the cycle repeats.

Finding and eliminating the cause of the line noise (0xff & 0x7f are some of
the more common "noise indicators") should enable you to use the standard
getty without problem.  This is in no way meant to indicate that your getty is
"inferior", it's just less hassle if you don't have to install local mods on
every release, right?

I would also expect that the line is noisy when under use by the PC operator.
If she isn't seeing any noise, I would **suspect** that the rs232 cable is 
being disconnected from the PC and thus acting as an antenna to induce signals
on the cable - this is a very common scenario in timesharing mini systems, and
can generally be addressed by installing an AB switch at the PC end (I think).

Good luck in killing the cause, once you find it.

...walt...

Walt Weber               Hewlett Packard NARC @ Apollo Systems Division
(508) 256-6600 x8315     People's Republic of Massachusetts
-The views expressed herein are personal, and not binding on ANYONE-