[comp.sys.apollo] getty problem at 10.1

lray@CIVILGATE.CE.UIUC.EDU (Leland Ray) (11/11/89)

It all started when I moved the node to SR10.1. Until that time, it had been
a well behaved DN4000. Connected to the node were two serial lines, both
connecting to PCs in nearby offices.

Several days after the node was converted, the owner began to complain. The
node was slow, so she told me, and the disk was quickly filling up.

Upon checking the node, I noticed that there was no free disk space, and both
gettys were in an infinite loop. Rebooting the node recovered 70M in disk space.
"Gee," said I, "this looks like a problem with the OS. Perhaps the 800 number
could help me out."

A call to the 800 number failed to resolve the problem. It seems nobody at Apollo
had a problem with runaway gettys. The only possible suggestion was that I'd
created the login_log file and that was growing without bounds. Of course, if it
was a real file, rebooting the node would not recover that space -- I'd simply
find a huge object on the disk I could delete.

Well, if getty doesn't work, perhaps dropping back on the old way -- the SR9 way
-- would solve the problem. siomonit came up easily, and did not destroy the
cpu or the disk as did getty. However, it did create many processes called "level2"
and fill up the process table, prohibiting any use of the node until reboot. So
it seems I was stuck with getty.

The next thing I did was to take advantage of the fact that my university is
blessed with a Unix source license. I got the source to getty and spent a long
time looking at it. I decided that the best way to tackle the problem would be to
port the code to SR10.1 and use syslog calls to log significant program events.
And this is what showed me the problem.

It seems that the serial lines, for as yet unknown reasons, were flooding the line
with large numbers of garbage characters. Most of them were ^? (Ascii 0xff) with
a few o's (Ascii 0x7f) thrown in for good measure.

What was happening was this: getty takes is the input of the login name only as
many characters as are allowed, and after that, it terminates and passes the
string to login even if the user never presses a return key. The Apollo serial
drivers terminate the process silently after 255 characters (try holding down a
repeating character while logged into an Apollo's serial line), which init
faithfully notices and calls up another getty. The time spent in initializing the
line, getting the environment, looking it up in gettytab, etc. while bogus
characters are being dropped in the line takes up 99% of a DN4000.

The fix I've made to my local copy of getty is the following: If the length of the
login name exceeds the maxmimum number of permissible characters, drop all
additional characters on the floor except the erase character, the kill character,
and the EOF character. The loop of getty that reads in incoming characters and
drops them takes up at most 2% of the DN4000's time.

All told, here are my thoughts regarding this problem:

   1. The Unix community (as this is a general Unix bug, and not an Apollo
      specific one) should modify more programs like getty to give debug output so
      these kind of problems can be tracked without having access to the source.

   2. I've reported the serial line 256th character dropout problem to Apollo,
      and I'm told it will be "fixed in a future software release."

   3. The problem with the process table filling up with level2's is a serious
      one, and should be fixed. I suspect this is incomplete process death,
      however, I don't know enough about it to precisely duplicate it. In
      addition, the DN10000 (SR10.1.p) also seems to accumulate level2's.

----------------------------------------------------------------------------------
Just spendin' my days,                  Leland Ray
                                        Systems Administrator
  Soakin' in them cathode rays.         UIUC - Dept. Civil Engineering
                                        (217) 333-3821