roy@phri.UUCP (Roy Smith) (02/10/89)
I sat down on my SunOS-3.5 system to do some rwhod hacking (*) today and ran across something which I consider pretty gross. Like many daemons, rwhod puts itself in the background by forking and having the parent exit. To make it easier (read possible) to debug, if you #define DEBUG, the fork code isn't compiled in and rwhod runs in the foreground. Since I was debugging it, I #defined DEBUG and started single-stepping the process in dbx. Much to my surprise, when it got up to sp = getservbyname("who", "udp"); I started getting, every 5 seconds: sendto 7f000001.111 hostname up 0:00 load 0.02, 0.17, 0.00 A bit of headscratching revealed what was going on. What rwhod does is to gather up some system statistics once a minute and broadcast the stats using a sendto() call. If you #define DEBUG, rwhod.c includes its own sendto() routine which instead of actually sending out a packet, just prints some stuff on your terminal. It would seem that the author of rwhod never realized that some library routines might also want to call sendto() to do some private stuff and the redefined call would break that. In this case, it was a Yellow Pages based version of getservbyname(). Makes it kind of hard to actually debug the program. I'm surprised Sun never ran across this problem before (I was working from the 3.2 rwhod.c, not having the 3.5 sources available). The moral of the story is that you shouldn't redefine system calls. If you really need to get some other behavior for (for example) sendto(), you should put a "#define sendto my_sendto" at the beginning of your source file. That way, you get the modified version, but library routines still get the regular one they were expecting. ---------------- (*) Just why was I hacking on rwhod you ask? As documented by Sun, rwho is a real performance pig. With N machines running rwhod on your net, you get N^2 packets received each minute. With lots of diskless clients, that means your NFS servers spend all their time servicing requests to write /usr/spool/rwho files. The result is that you know what machines are up, but you can't do anything useful on any of them. The N^2 effect isn't so bad when you've got 15 or 20 machines, but it kills you when you've got hundreds. My idea was to make rwhod write into /usr/lib/rwho instead of /usr/spool/rwho. Each of the diskless clients would run rwhods which sent out status packets but didn't listen for any. Each file server would run a normal rwhod which sent out stats and also listened for status packets and wrote them to /usr/lib/rwho. The diskless clients would NFS mount /usr/lib/rwho, and rwho and ruptime would know to look there instead of /usr/spool/rwho for their data. The end result is that every machine has a functioning rwho and ruptime, but the load of listening for and writing out the rwho broadcast packets would be reduced by an order of magnitude, not to mention the secondary savings of the servers not having to constantly page rwhod in and out on each of the clients. -- Roy Smith, System Administrator Public Health Research Institute {allegra,philabs,cmcl2,rutgers}!phri!roy -or- phri!roy@uunet.uu.net "The connector is the network"