lray@CIVILGATE.CE.UIUC.EDU (Leland Ray) (11/12/89)
It all started when I moved the node to SR10.1. Until that time, it had been a well behaved DN4000. Connected to the node were two serial lines, both connecting to PCs in nearby offices. Several days after the node was converted, the owner began to complain. The node was slow, so she told me, and the disk was quickly filling up. Upon checking the node, I noticed that there was no free disk space, and both gettys were in an infinite loop. Rebooting the node recovered 70M in disk space. "Gee," said I, "this looks like a problem with the OS. Perhaps the 800 number could help me out." A call to the 800 number failed to resolve the problem. It seems nobody at Apollo had a problem with runaway gettys. The only possible suggestion was that I'd created the login_log file and that was growing without bounds. Of course, if it was a real file, rebooting the node would not recover that space -- I'd simply find a huge object on the disk I could delete. Well, if getty doesn't work, perhaps dropping back on the old way -- the SR9 way -- would solve the problem. siomonit came up easily, and did not destroy the cpu or the disk as did getty. However, it did create many processes called "level2" and fill up the process table, prohibiting any use of the node until reboot. So it seems I was stuck with getty. The next thing I did was to take advantage of the fact that my university is blessed with a Unix source license. I got the source to getty and spent a long time looking at it. I decided that the best way to tackle the problem would be to port the code to SR10.1 and use syslog calls to log significant program events. And this is what showed me the problem. It seems that the serial lines, for as yet unknown reasons, were flooding the line with large numbers of garbage characters. Most of them were ^? (Ascii 0xff) with a few o's (Ascii 0x7f) thrown in for good measure. What was happening was this: getty takes is the input of the login name only as many characters as are allowed, and after that, it terminates and passes the string to login even if the user never presses a return key. The Apollo serial drivers terminate the process silently after 255 characters (try holding down a repeating character while logged into an Apollo's serial line), which init faithfully notices and calls up another getty. The time spent in initializing the line, getting the environment, looking it up in gettytab, etc. while bogus characters are being dropped in the line takes up 99% of a DN4000. The fix I've made to my local copy of getty is the following: If the length of the login name exceeds the maxmimum number of permissible characters, drop all additional characters on the floor except the erase character, the kill character, and the EOF character. The loop of getty that reads in incoming characters and drops them takes up at most 2% of the DN4000's time. All told, here are my thoughts regarding this problem: 1. The Unix community (as this is a general Unix bug, and not an Apollo specific one) should modify more programs like getty to give debug output so these kind of problems can be tracked without having access to the source. 2. I've reported the serial line 256th character dropout problem to Apollo, and I'm told it will be "fixed in a future software release." 3. The problem with the process table filling up with level2's is a serious one, and should be fixed. I suspect this is incomplete process death, however, I don't know enough about it to precisely duplicate it. In addition, the DN10000 (SR10.1.p) also seems to accumulate level2's. -------------------------------------------------------------------------------- -- Just spendin' my days, Leland Ray Systems Administrator Soakin' in them cathode rays. UIUC - Dept. Civil Engineering (217) 333-3821
jwright@atanasoff.cs.iastate.edu (Jim Wright) (11/12/89)
lray@CIVILGATE.CE.UIUC.EDU (Leland Ray) writes: | 1. The Unix community (as this is a general Unix bug, and not an Apollo | specific one) should modify more programs like getty to give debug output | so | these kind of problems can be tracked without having access to the source. Yeah, let's start by putting a debug mode into sendmail. -- Jim Wright jwright@atanasoff.cs.iastate.edu
weber_w@apollo.HP.COM (Walt Weber) (11/15/89)
In article <8911111819.AA02201@civilgate.ce.uiuc.edu> lray@CIVILGATE.CE.UIUC.EDU (Leland Ray) writes: > >It all started when I moved the node to SR10.1. > >Upon checking the node, I noticed that there was no free disk space, and both >gettys were in an infinite loop. Rebooting the node recovered 70M in disk space. > >It seems that the serial lines, for as yet unknown reasons, were flooding the > line with large numbers of garbage characters. Most of them were ^? > (Ascii 0xff) with a few o's (Ascii 0x7f) thrown in for good measure. > Leland - First of all, it looks like you addressed the symptoms of the problem with your mods to getty (like you said, good thing you were "blessed with the source"). It sure looks like the underlying CAUSE is yet to be addressed, however, since you are still getting what appears to be line noise on the serial port. The behavior is similar to that observed when a modem is attached for dial-out, but is not configured to eliminate result codes (like any Hayes compatible, for example). Thus getty sends a prompt to the modem, the modem sends back either OK or ERROR (which getty interprets as a login user name & exec's login), login prompts for a password and exits on the timeout signal, which respawns getty, and the cycle repeats. Finding and eliminating the cause of the line noise (0xff & 0x7f are some of the more common "noise indicators") should enable you to use the standard getty without problem. This is in no way meant to indicate that your getty is "inferior", it's just less hassle if you don't have to install local mods on every release, right? I would also expect that the line is noisy when under use by the PC operator. If she isn't seeing any noise, I would **suspect** that the rs232 cable is being disconnected from the PC and thus acting as an antenna to induce signals on the cable - this is a very common scenario in timesharing mini systems, and can generally be addressed by installing an AB switch at the PC end (I think). Good luck in killing the cause, once you find it. ...walt... Walt Weber Hewlett Packard NARC @ Apollo Systems Division (508) 256-6600 x8315 People's Republic of Massachusetts -The views expressed herein are personal, and not binding on ANYONE-