brucee@runx.ips.oz (Bruce Evans) (02/15/88)
This is a report on some of the Minix bugs I have attacked. None of these have been discussed fully in comp.os.minix as far as I know. The most important fixes are for (1) shell does not initialise a signal-catching longjump vector properly. (2) kernel uses uninitialised processor direction flag. (3) kernel does locks without unlock/restore. I have provided fairly complete descriptions for the fixes, rather than diffs, because the actual differences are small and easy to do by hand, and small differences tend to be ambiguous (without context diffs). A reference to The Book is given if possible, otherwise to V1.2+ sources. Some of the bugs are timing-dependent and may only show up on a slow PC like mine. It is possible that the bugs noted but not fixed are only compiler bugs. SYSTEM BUGS ----------- Direction flag -------------- In a recent posting, Bing Bang said that the direction flag was not being cleared properly in kernel/klib88.s, wth cp_mess() being the only (?) thing messed up. In fact the main routines in klib88.s do not initialise it at all (after an interrupt)! So phys_copy(), vid_copy(), resvec(), and scr_up() for the new tty driver are also affected. scr_down() of course needs the opposite direction and sets and restores it correctly. It is amazing that this has caused so few problems. The kernel is vulnerable while any library routine has the direction flag set. I use an assembler version of memcpy() which sets it, but cannot attribute any crashes directly to this bug. The kernel ought to clear the direction flag on every interrupt (the processor ought to but doesn't) and keep it clear except in subroutines. This requires some cooperation from the compiler and library routines. Some compilers also require it. For now, it's simpler and more portable to clear the flag in every routine needing it. Fix: Add lines "cld" to klib88.s in the following places. Your line numbers probably differ. after "cli" in phys_copy() (line 46) later would be clearer but don't put it in a loop before "rep" in cp_mess() (line 168) best place after "cli" in vid_copy() (line 482) before "rep" in scr_up() (line 546??) before "rep" in res_vec() (line 802??) Fixing this is supposed to let us avoid turning interrupts off for so long and get better response. I tried deleting the "cli" in phys_copy() (the one in cp_mess() is usually redundant and the new tty driver does not use vid_copy()). This seems to work but not noticeably change the response time, but needs further testing. Floppy driver ------------- The variable fl_curcyl (part of struct floppy) is never initialised. (It is "cleared" in various places.) Fix: Add the line fp->fl_curcyl = fp->fl_cylinder; just before the successful return from seek() in floppy.c, line 2785 (Book). Effect: This might have wasted a lot of time doing unnecessary seeks, but it appears that the disk controller doesn't need to do any extra work, and in practice the floppy did not go any faster when the change was made. I only found this bug when tracing another one. The busy_map flags set by interrupt() turn out to be rarely set. But the seek command was setting them a lot, probably because the controller realised a seek was unnecessary and responded instantly. Lock/restore ------------ It is difficult to work with the lock/unlock-restore functions to control interrupts. Several better methods have already been mentioned in comp.os.minix. I haven't implemented any of these, but to track down an interrupt-related bug I added some counters to the functions. These detect locks being overwritten. The following problems showed up. All are caused by lock calls without a corresponding unlock. They seem to be fairly harmless, but who knows. (1) kernel/proc.c, unready(), lines 2164 and 2174 (Book). The function has done a lock, but returns without an unlock. Fix: Replace ... return; by ... { restore(); return; } The 2nd return is executed fairly consistently when a shell is killed. (2) kernel/printer.c, do_write(), line 111. A lock is done, but there is no corresponding unlock in the whole driver! I'm not sure how to fix this. It would probably work to put an unlock when the function returns, but there are some busy-wait loops so this is not a good solution. (3) kernel/main.c, main(), line 902 (Book) (first line of C executed). A lock is done without a corresponding unlock. This is certainly harmless, but is redundant because the lock has already been done by a "cli" in fsck.s and again in head.s. It is confusing to have it here, since a system wouldn't normally work if interrupts were on at the start (but the BIOS interrupt handlers are valid here, so it probably doesn't matter). Reading and writing ------------------- Doing zero bytes of i/o to a bad file number should probably be an error. I had a bug which caused this strange activity, but Minix did not complain so the error handler did not get called. Signal system call ------------------ Probably needs to return more complete status. See section on library signal(). Writing to /dev/null -------------------- The recent official patch to fs/read.c breaks writing to /dev/null. It is for rw_chunk(), line 9963 (Book), and involves adding a test && !block_spec to the others on the line. Since /dev/null is block special, the test now fails where it used to succeed. The result is that get_block is called for /dev/null. The driver returns EOF which gets translated into a funny error number (104) and the write fails. Fix: (?) Make /dev/null character special instead of block special. su; cd /dev; rm null; mknod null c 1 3 It certainly doesn't make any sense to cache /dev/null - it probably has been wasting a block. The change makes /dev/null much faster; 30 sec reduced to 20 sec for copying 400K from hard disk to it. Further problem: isatty() now returns TRUE for /dev/null, as it does for all character special devices. COMMANDS BUGS ------------- chmem ----- chmem.c, line 46 if ( (header[0] & 0xFFFF) != MAGIC) This mask is non-portable. header[0] is long, and K & R say (long) 0xFFFF == -1 == 0xFFFFFFFF which is not what is wanted. Modern PC compilers mostly say otherwise. Fix: if ( (header[0] & 0xFFFFL) != MAGIC) There are also some byte-order portability problems. I ran Minix for a while with the wierd long byte order of 2-3-0-1 (I think it came from something justified on a PDP11, but stupid on an 8088). Minix itself ran fine but there were a few problems with the utilities (mkfs is the main problem, fdisk a minor problem, and fsck no problem). grep ---- Doesn't stop at end of file. Cause: Value returned by getc() on line 158 of grep.c is assigned to a char variable. This doesn't matter much if characters are signed (but char 0xff will give a false eof), but my compiler uses unsigned characters (which are more efficient on the 8088). The effect was that EOF was treated as 0x00ff and not 0xffff, a valid char, and grep received an infinitely long line of them. Fix: Replace char *initbuf = buf, c; by char *initbuf; int c; Please declare variables on separate lines. pwd --- pwd.c, line 8. The version I received is missing a semicolon after a structure declaration. This leads to main() being declared as returning a struct direct. pr -- The new version prints garbage in the last few positions when multiple columns are used. Also, memory is probably not freed when it could be. Cause: The carefully defined NIL_PTR is dereferenced on line 264 of pr.c, resulting in garbage null pointers. Fix: Change ... = *(char **) NIL_PTR; to ... = NIL_PTR; To be consistent, the (logical) comparison with 0 on line 347 should also be replaced with an explicit pointer comparsion. sh -- (1) Symptoms: Holding down the DEL key sometimes causes shell to logout, sometimes to hang. The exact behaviour varied as I changed other parts of the system (own clock handler mods, new tty driver, login mods posted by Peter Housel). Cause: SIGINTs generated by the DEL key are not safely trapped in sh1.c. The relevant lines are 224-226. if (talking) signal(SIGINT, onintr); if (setjmp(failpt = m1) || yyparse() || intr) { Here m1 is a local jump buffer and failpt is a global pointer to it. failpt needs to be initialised before the signal(), because it is used by the signal catcher onintr(). Any sigint caught in the huge :-) gap between the signal() and the setjmp() will activate this bug. Fix: Just before these lines, add setjmp(failpt = m1); This setjmp() should always return directly, not through a longjump(), since onintr() is inactive here. (It had better be, because failpt is certain to be garbage. It would be sensible to deactivate failpt explicitly in the signal catcher and make it abort with an "Can't happen" type error message.) Unfinished business: Holding down the DEL key STILL kills the shell in some situations. When the system is overloaded, this is probably an inherent problem with *ix-style signal handling. The shell catches a SIGINT and attempts to signal a catch for the next one, but the kernel services the SIGINT before setting up the catcher. This problem can be consistently duplicated by doing a large cp in the background. Unfortunately, it still happens (much less than without this fix) when the shell is on its own. At least it no longer hangs. There's also something strange about the way this bug is so easy to duplicate, since the window of vulnerability is so small. It looks like the kernel is queueing the signal and dequeueing it when the catcher is installed, but I couldn't see anything like that in the code. Something in the login mods made the bug more apparent. The old login (except for the first) had all signals except those the shell sets ignored. The key SIGINT is not directly affected. Maybe just looking non-ignored signals clogs up the signal processing. The standard Minix longjmp() (which I don't use) backtraces the frame pointers and might have caught this bug. It just halts the processor (but not safely, the first interrupt will cancel the halt). Maybe the bug doesn't show up in standard Minix. (2) Type at the prompt for i in 1 2 <RETURN> <DEL> Then the next attempt at a "for" command gives a sytax error. Logging out cures this of course. Doing a "set" command also works. size ---- size.c, line 49 if ( (head[0] & 0xFFFF) != MAGIC) Same bug as in chmem.c. Fix: if ( (head[0] & 0xFFFFL) != MAGIC) su -- Does not pass on the environment. See also execn(). time ---- Uses its own version of $PATH (:/bin:/usr/bin). Does not pass on the environment. treecmp ------- Doesn't handle file names of length 14 properly. Probably not null terminated. wc -- wc.c, line 130 while((c = getc(stdin)) > 0) { Surely 0 is a valid char? wc doesn't count executables too well with this. The Minix header stops them very early. Fix: while((c = getc(stdin)) >= 0) { LIBRARY ------- execn() ------- Doesn't set up an adequate environment. I forget the details, it may be that argv[0] is null (ps has trouble with the commmand name). Affects init and (much worse) su. signal() -------- Part of the work for the Minix signal() system call is done by the library routine, because the kernel only keeps track of one signal catcher and not one for each signal. The library routine tries to keep track of the individual catchers and the status of uncaught signals (SIG_DFL or SIGN_IGN) in an array. This doesn't work for ignored signals after a fork-exec when the child wants to ignore some signals. The array is in bss and gets 0-initialised to SIG_DFL. The kernel just returns 0 all successful calls. Affect: e.g. dd < bigfile > bigcopy & The dd in the background is aborted by all foreground SIGINTS. Suggested fix: Have the kernel return SIGN_DFL for previously defaulted signals SIG_IGN for previously ignored signals catcher for previously caught signals (easier to check than 0) The library routine would then have enough information to do its job. system() -------- Doesn't check for failure of fork(). I think this is what causes commands with multiple pipes to often hang. Should probably use execlp (when available) instead of execl. Should probably pass on the environment. Bruce Evans (brucee@runx.ips.oz)