lray@CIVILGATE.CE.UIUC.EDU (Leland Ray) (11/11/89)
It all started when I moved the node to SR10.1. Until that time, it had been a well behaved DN4000. Connected to the node were two serial lines, both connecting to PCs in nearby offices. Several days after the node was converted, the owner began to complain. The node was slow, so she told me, and the disk was quickly filling up. Upon checking the node, I noticed that there was no free disk space, and both gettys were in an infinite loop. Rebooting the node recovered 70M in disk space. "Gee," said I, "this looks like a problem with the OS. Perhaps the 800 number could help me out." A call to the 800 number failed to resolve the problem. It seems nobody at Apollo had a problem with runaway gettys. The only possible suggestion was that I'd created the login_log file and that was growing without bounds. Of course, if it was a real file, rebooting the node would not recover that space -- I'd simply find a huge object on the disk I could delete. Well, if getty doesn't work, perhaps dropping back on the old way -- the SR9 way -- would solve the problem. siomonit came up easily, and did not destroy the cpu or the disk as did getty. However, it did create many processes called "level2" and fill up the process table, prohibiting any use of the node until reboot. So it seems I was stuck with getty. The next thing I did was to take advantage of the fact that my university is blessed with a Unix source license. I got the source to getty and spent a long time looking at it. I decided that the best way to tackle the problem would be to port the code to SR10.1 and use syslog calls to log significant program events. And this is what showed me the problem. It seems that the serial lines, for as yet unknown reasons, were flooding the line with large numbers of garbage characters. Most of them were ^? (Ascii 0xff) with a few o's (Ascii 0x7f) thrown in for good measure. What was happening was this: getty takes is the input of the login name only as many characters as are allowed, and after that, it terminates and passes the string to login even if the user never presses a return key. The Apollo serial drivers terminate the process silently after 255 characters (try holding down a repeating character while logged into an Apollo's serial line), which init faithfully notices and calls up another getty. The time spent in initializing the line, getting the environment, looking it up in gettytab, etc. while bogus characters are being dropped in the line takes up 99% of a DN4000. The fix I've made to my local copy of getty is the following: If the length of the login name exceeds the maxmimum number of permissible characters, drop all additional characters on the floor except the erase character, the kill character, and the EOF character. The loop of getty that reads in incoming characters and drops them takes up at most 2% of the DN4000's time. All told, here are my thoughts regarding this problem: 1. The Unix community (as this is a general Unix bug, and not an Apollo specific one) should modify more programs like getty to give debug output so these kind of problems can be tracked without having access to the source. 2. I've reported the serial line 256th character dropout problem to Apollo, and I'm told it will be "fixed in a future software release." 3. The problem with the process table filling up with level2's is a serious one, and should be fixed. I suspect this is incomplete process death, however, I don't know enough about it to precisely duplicate it. In addition, the DN10000 (SR10.1.p) also seems to accumulate level2's. ---------------------------------------------------------------------------------- Just spendin' my days, Leland Ray Systems Administrator Soakin' in them cathode rays. UIUC - Dept. Civil Engineering (217) 333-3821