scs@adam.mit.edu (Steve Summit) (02/10/91)
In article <1991Feb07.013637.6542@convex.com>, Tom Christiansen writes: >As was mentioned earlier, *every* system call should *always* >be checked, even if you "know" it can't fail. Indeed. However, at the risk of confusing the issue further just when this basic fact is starting to sink in, it's important to remember, when you set out to check for every error everywhere, that the thing to do upon finding one may not be to complain and abort. I'll provide several examples. My own code is usually liberally sprinkled with assertions and sanity checks. One fruitful place to put them is in the default case of a switch, when the default should "never" be reached. Suppose I have a little routine which decodes the value of some internal, numeric flag (presumably so it can be printed out somewhere): #include "status.h" char * status_str(sts) int sts; { switch(sts) { case STS_GOLDEN: return "STS_GOLDEN"; break; case STS_DEMENTED: return "STS_DEMENTED"; break; case STS_ANHYDROUS: return "STS_ANHYDROUS"; break; default: panic("status_str: bad status %d", sts); } } The panic is there, of course, so that next week when I add another status code but forget to update the status_str() routine, my mistake will be very noticeable, and there will be a strong reminder and incentive to fix it. But does it really have to be a fatal error? Probably not, especially if there's any chance that the "impossible case" won't be triggered by me, but by some (hopefully beta-test) user. A better way to handle this particular "impossible case" is nonfatally: default: { static char tmpbuf[15]; oddity("status_str: bad status %d", sts); sprintf(tmpbuf, "status %d", sts); return tmpbuf; } In my code, "oddity" is just like panic (i.e. it prints debugging messages that are to be construed as the developer's problem, not the user's) but it's nonfatal. There's still a noisy message, there's still incentive and reminder to fix the bug, but at least the program can continue to run. There was a case like this in the 4.1bsd tty driver. (I don't know if it's still there.) The driver tries to do CRT erase processing (i.e. the backspace-space-backspace stuff that wipes characters off the screen as you press the delete or rubout or backspace key or whatever) perfectly. The tricky part is that you might need 1, 2, or up to 8 backspaces, depending on whether there is a normal character, a control character (echoed as "^X"), or a tab on the screen being erased. The code for figuring out how many backspaces to emit involved a big array full of action codes. There were a few impossible cases. Sure enough, they were handled with a panic. I don't know about you, but I think that aesthetically pleasing crterase processing, though nice, is a bit of a frill, and probably one of the last things that it's vital for an operating system kernel to do perfectly. I would be very, very annoyed (well, to tell the truth, I'd also be ridiculously amused) if a machine crashed, dropping several users on the floor and losing lots of their work, just because some poor schlep tried to hit the delete key to get rid of some "impossible" character on his command line. (No, it never happened, but once I'd seen that code in the tty driver, I was forever slightly haunted with the fear that the system was going to crash because of it. In fact, hitting the impossible case would have represented a programming error, not an impossible character. Still, the panic seemed needlessly draconian.) Finally, consider write(1). On the recipient's screen, it prints something like message from scs on tty23 at 12:34 ... Now, whatever method it uses to find out what terminal the sender is logged into (probably ttyname(3)), it's not going to work if the sender doesn't have one. How could the sender not have one? Actually, the burden of proof should be on the author of write(1) to show why the program can't proceed if the tty is not known, not on the user to suggest circumstances under which it might be meaningful not to have one, but to satisfy the skeptics out there I'll mention a real-world, non-fabricated case: at jobs. (Actually, this case, or another one involving write(1), came up on the net just a week ago.) I sometimes like to send myself notification that one of my at jobs has finished by including echo "at job is done" | write scs in the at script. However, at scripts run with no controlling tty, so write(1) aborts and refuses to send the message, just because it won't be able to say where it's from. Just now I tried write(1)ing to myself from within emacs, so that I could see the exact "message from..." message on my screen. But since I'm running the Gigantic Neanderthal Ubiquitous emacs, which runs shell escapes in some kind of weird subshell with pipes, all I got was the "write: can't find your tty" message in the *Shell Command Output* window. Some time ago I wrote my own version of write(1), mostly so that I could put in the wrinkle that it leave out the "on ttyxx" part, rather than falling back and punting, if it couldn't determine it. (It also prints to several ttys if the user is logged in multiple times, rather than picking the wrong one. Now, of course, in these security-conscious days, private versions of write(1) don't work well, because they should be setgid.) So yes, check for all error returns. And yes, print noisy messages (unless the error is truly benign, as in the write(1) example). But think for a minute whether the error really has to be fatal, whether there's any reason the program can't continue. If it can't; if the error means that vital data has been corrupted such that continuing would be futile or worse; then by all means abort. But if you can safely continue with little or no effort, please do. Steve Summit scs@adam.mit.edu P.S. I am aware that there are better ways of decoding numeric flags than by using a big switch statement as in the first example. (In fact, I usually use a lookup table.)
duncan@comp.vuw.ac.nz (Duncan McEwan) (02/12/91)
In article <1991Feb07.013637.6542@convex.com>, Tom Christiansen writes: >As was mentioned earlier, *every* system call should *always* >be checked, even if you "know" it can't fail. I'm not sure if this has been discussed in this thread yet, but one possibility that I've seen suggested elsewhere is to have failing system calls generate a signal to the calling process (SIGSYS2 anyone?). The default action might be to terminate the process, though perhaps printing a diagnostic and resuming would be more compatible with current behaviour. I can't offhand think of any particularly nice way of implementing a libc provided default handler that could do the latter without messing with the compilers or crt0.o, etc, though. It would be nice if such a scheme could be implemented entirely without kernel support (perhaps by putting the signal generating stuff in the libc wrappers for each system call), but I don't think that would allow passing a parameter to the handler (which you would want, to give information about what system call failed). Hm, thinking about it, this all sounds a bit VMSish, and if its only aim is to provide protection from programmers who don't check every syscall, then its probably a non-starter :-) Comments, suggestions, flames, anyone? Duncan
hunt@dg-rtp.rtp.dg.com (Greg Hunt) (02/12/91)
In article <1991Feb12.011911.7677@comp.vuw.ac.nz>, duncan@comp.vuw.ac.nz (Duncan McEwan) writes: > > I'm not sure if this has been discussed in this thread yet, but > one possibility that I've seen suggested elsewhere is to have > failing system calls generate a signal to the calling process > (SIGSYS2 anyone?). The default action might be to terminate the > process, though perhaps printing a diagnostic and resuming would > be more compatible with current behaviour. I can't offhand think > of any particularly nice way of implementing a libc provided default > handler that could do the latter without messing with the compilers > or crt0.o, etc, though. > > It would be nice if such a scheme could be implemented entirely > without kernel support (perhaps by putting the signal generating > stuff in the libc wrappers for each system call), but I don't think > that would allow passing a parameter to the handler (which you would > want, to give information about what system call failed). > > Hm, thinking about it, this all sounds a bit VMSish, and if its > only aim is to provide protection from programmers who don't check > every syscall, then its probably a non-starter :-) > > Comments, suggestions, flames, anyone? I'd rather not see it myself. I already have code in which I print out a message when I get SIGSYS (bad argument to system call), and then ignore the signal, thus returning control to the code that made the bad call. It then (supposedly) gets error EINVAL, and since I'm checking the error returns from my system calls, I then print out what system call I was trying to do, what error I got, and any other information I can provide, such as the name of the file I was reading or writing if it was a file system related call. I do similar error reporting for any error from any system call that my code is not prepared to deal with (you can correctly surmise from this which side of the "should I check error returns from system calls" debate I'm on). You mentioned in your second paragraph that you'd want to pass such information somehow so that the signal handler can print it out for you. I think it is easier to handle such errors by checking the return from the system call. It would also allow you to print out a large amount of information that it might be difficult to do with a single parameter to a signal handler. For me, coding error handling where the error occurs certainly makes the code easier to understand. I agree with your last statement. It's a non-starter. That's my opinion, too. -- Greg Hunt Internet: hunt@dg-rtp.rtp.dg.com DG/UX Kernel Development UUCP: {world}!mcnc!rti!dg-rtp!hunt Data General Corporation Research Triangle Park, NC, USA These opinions are mine, not DG's.