brucee@runx.ips.oz (Bruce Evans) (07/08/88)
I found the following bugs while beating hard on the system recently, doing Minix development on Minix for the first time and installing the 1.3 diffs. The system held up very well and my only major dis-satisfaction is the slowness of fs+drivers. Most of the bugs are still in 1.3. All diffs are relative to 1.3. lib === sbrk() doesn't check underflow ------------------------------ Sbrk() in brk.c still doesn't do the bounds checking right. 1.2 didn't check anything. 1.3 should check underflow as well as overflow. Sbrk makes the undocumented assuption that the initial brksize is even. It also declares unused undefined externs endv and dorgv. brk.c.diff: ----------- 24d23 < extern int endv, dorgv; 28c27,28 < if (incr > 0 && newsize < oldsize) return( (char *) -1); --- > if (incr > 0 && newsize < oldsize || incr < 0 && newsize > oldsize) > return( (char *) -1); ----------- ctime fails on Dec 31 of leap years ------------------------------------- This bug has already been reported for the date command. My fix is simpler. In ctime.c, too much is subtracted from the 't' variable, allowing it to go negative. I checked the fixed version against another ctime implementation. They agreed every hour 1970 to 1999. Minix ctime still fails miserably before 1970 and more reasonably after 2000. This is for the upgraded ctime posted just after the 1.3 diffs. ctime.c.diff ------------ 47a48,50 > { > if ( t < YEAR + DAY ) > break; 48a52 > } ------------ fopen.c needs to include <errno.h> ---------------------------------- to get at errno and ENOENT. commands ======== asld divides by zero -------------------- ZERO = /0 causes this. (Believe it or not, "/" means "0x" for at least one Xenix assembler.) echo discards quotes and fails with lots of arguments ----------------------------------------------------- The buffering method in echo fails on arguments falling across the buffer boundary. Since the buffer size is 2048 and exec used to provide at most 1024 bytes, no one noticed. With the new exec size of 2048 it should show up, but I haven't re-compiled enough of 1.3 to check. fdisk incompatible with DOS 3.3 fdisk ------------------------------------- Partitions built with fdisk cause DOS 3.3 fdisk to crash, because of an improper value in the incompletely documented sysind field. More advice needs to be given on setting up Minix partitions. I have done it many times, but still have trouble. DOS 3.3 fdisk at least _formats_ the partitions it sets up, so you can't modify an existing partition without losing its data. I found this out the hard way after losing the Minix partition table entry when DOS fdisk thought it was linked to the DOS entry (it was, because although I used Minix fdisk to set up the Minix partition, I changed the sysind field from 0 to the (linking) value generated by DOS fdisk to stop DOS fdisk from crashing). Another important thing is to reboot after changing the partition table. If you think that the new values are effective immediately, it's easy to mkfs the wrong partition. getc() assigned to character variable ------------------------------------- The following commands assign the value returned by getc() or getchar() to a character variable which may be tested against EOF later. This fails when characters are unsigned, and gives a premature EOF when the file contains (char) EOF anyway. A couple of these test the getc() value is <= 0, but only == EOF and < 0 are correct. A global search produced: ast.c readaline() df.c getname() fdisk.c get_a_char() mkfs.c getline() prep.c main(), skipline(), backslash() uniq.c getline() gets() buffer overflow possible ------------------------------- Several commands, e.g. date and fdisk, use gets when fgets() would be better because it doesn't allow the buffer to overflow. head.s needs to define environ ------------------------------- Init calls execn() which is from exec.c and something there references environ. Normally environ is defined in crtso.s but head.s is linked instead for mm/fs/init. I think exec.c and the other library files which declare lots of functions should be split up. This will make the source distribution larger (not much if it is archived) and the binaries a little smaller. Another problem with init and the environment is that the 1.2 version used the 1.2 execn() which doesn't set up the environment. This didn't hurt me in 1.2 (why?) but in 1.3 the sh working on /etc/rc ran out of string space trying to set up a junk environment. mkfs allows file systems larger than device ------------------------------------------- Mkfs.c attempts to test for file systems too large for the device by writing to the block after the last, reading it back, and comparing. This doesn't work because the i/o goes into the cache and always succeeds. (Eventually the cache is flushed and an error is generated. The error handling for devices with a size is confusing: the error is EOF which is ignored.) Thus errors from the RAM disk are completely ignored. I fixed this by doing a sync() between the write() and read(). mv messes up setuid bit and aliases ----------------------------------- Doesn't preserve setuid bit when it uses cp. mkdir a; cd a; > a; mv a ../a silently deletes file a. mkdir a; cd a; > a; mv a .. silently deletes file a (directory a seems OK). rm(?) deletes "." and ".." -------------------------- Several times I somehow deleted the "." and ".." entries in a directory, using a command like rm -r ../anotherdir. The file system remained valid (fsck should really find this error), and the rest of the files in the directory were recovered by going up a level and doing mv `ls baddir` gooddir. sh quoting bugs --------------- a=`echo b c` reduces to a=b c, so assigns a=b and executes c Xenix sh assigns the entire output of a graved command (independent of quotes in the output). a="b c"; d=$a evaluates as d=b c and fails a='"b c"'; d=$a evaluates as d=b 'c"' a='"b c"'; d="$a" does the same a=a; b=' '; c=c; d=$a$b$c evaluates as d=a c a='"a"'; b='"b"', then c=$a$b works c="$a$b" eats the first quote off $a I couldn't find a general way to make the shell concatenate 2 variables!!!! tty === When 2 processes try to do tty i/o at once, the 2nd fails in bad ways. The worst is with multiple writes. This can only happen after the 1st writer is suspended. The 1st writer's state is overwritten, in particular the process number on line 3917 of the book. Without the process number, un-suspension is impossible. Killing the stuck process used to kill the system but that seems to have stopped. To see this bug, start several processes doing main() { while(1) printf("*"), sleep(2); } and hit ^S^Q. Fixing this properly would be best done by holding the state in fs. The 2nd writer (or reader) should not get an error either. Fs can handle this by a suspension which records the i/o request in the proposed state entry. A FIFO queue can be used to select outstanding i/o requests when the device becomes free. I made a related change a few months ago to allow fs to parcel out humungous write requests in pieces acceptable to the driver. The driver just returns a bytes_written count < bytes_to_write for automatic suspension. Very few programs are written to handle incomplete writes like this, and they shouldn't have to. Perhaps large writes to pipes can be broken up in similarly. When a 2nd process tries to read from tty, the error code E_TRY_AGAIN is returned (line 3793 of the book). I think this error number filters back to the caller, though it is supposed to be for the kernel's internal use. Anyway the read fails and few programs can handle the errno. A simple example is set | more Output from set is nonstandard. More grabs the tty for reading and sh input fails so sh logs out. zero taken as EOF ----------------- This has been fixed in a few commands (e.g. wc), but it's still in "more", which confuses 0 from the buffer with the value returned by input() for EOF. non-bugs ======== elle wishlist ------------- Editing a linked file shouldn't destroy the link. (The .bak file stays linked.) Shelling out should preserve the environment. Backup files shouldn't be created automatically (hate cleaning them up). Quitting shouldn't require confirmation. mount wishlist -------------- If the floppies are writable by everybody, it's too easy to write on a mounted floppy. Mount should change the permissions to stop this. Just like login (should) change tty permissions to protect logins from being read by others. writing to stdin, reading from stdout and stderr should be allowed (?) ---------------------------------------------------------------------- Some Unix programs actually need this, e.g. compress and patch. Xenix certainly allows both reading and writing to file descriptors 0, 1 and 2, provided they have not been redirected by the shell. I think the reason is to allow stderr at least to serve as a channel for handling input in response to error output, after stdin has been redirected. This can be arranged by changing all the open modes for ttys in init.c to 2. I have been running this for 3 months with no problems. zombie externs and conflicting use of extern library names ---------------------------------------------------------- Some commands and library routines declare unused variables as extern, and the variables are not even defined externally. It's dangerous to redefine a function in the library. If the new definition is incompatible, everyone is confused and library routines may call the wrong function (or non-function - some linkers don't complain even if printf is declared as an int). If it's supposed to be compatible, time is wasted and there's more chance of bugs. (When ctime is fixed, we can rip out lots of buggy date routines.) My linker is fussy and found these. df.c: declares the unused undefined extern super_block via super.h ed.c: defines its own incompatible version of setbuf() ed.c: defines its own compatible(?) version of system() make.c: defines its own compatible(?) version of getenv() mknod.c: defines its own compatible(?) version of atoi() readfs.c: declares numerous unused undefined externs via buf.h and super.h shar.c: defines index as a global int shar.c: defines its own incompatible version of putchar() sort.c: defines a global function incr() which is global int in termcap.c. This is termcap's fault. termcap.c: declares unused undefined externs ospeed, PC, BC, UP touch.c:defines its own compatible(?) version of std_err() uudecode:defines its own compatible(?) version of index() Bruce Evans Internet: brucee@runx.ips.oz.au UUCP: uunet!runx.ips.oz.au!brucee
brucee@runx.ips.oz (Bruce Evans) (07/13/88)
I found some more bugs. These bit harder so I found all reasons and some fixes. New tty ======= (1) rs_struct[1].rs_base in tty.c init_rs232() is initialised when there is no rs_struct[1]. My vid_base was was destroyed :-). Make this conditional: #if NR_RS_LINES > 1 rs_struct[1].rs_base = SECONDARY; #endif (2) SMALL_STACK in table.c is too small for TTY. The problem shows up mainly mainly with the F2 dump. Printk uses a lot of stack. My proc_ptr was destroyed first :-). Increase the size from 256 to 512: #define TTY_STACK 512 /* SMALL_STACK is just too small */ (3) base in tty.c config_rs232() uses the wrong rs232-line number. The rs_struct's are not offset by NR_CONS. All this bug does is stop rs232 working. base = rs_struct[line].rs_base; (4) The keyboard initialization is probably not solid. I use a boot program in which the equivalent of the '=' key can be hit after loading everything from the disk, and the system can lock up. I think there is no time for the BIOS to handle the key release, and the keyboard handler doesn't see it because I reprogrammed the interrupt controller (to undo the misplacement of the device vectors). My problem was easy to fix by strobing the keyboard during console initialization. Perhaps there are other devices with incomplete initializations depending on undocumented states left by the BIOS, (the screen of course)? In any case, it is WRONG to enable interrupts before the devices have been initialised (in main.c). I also don't like the centralization of setting up device vectors. After fixing all these and merging the local changes which interface to my debugger (switching to a stand-alone set of i/o routines requires support from the O/S since too many hardware registers are write-only), the new tty worked, sort of. I took out the debugging statement which echos all rs232 output to the screen. It worked at as a dumb terminal at 9600 baud on my 386 system. Zmodem output worked, zmodem input didn't. I thought the problem might be with a too cavalier approach to discarding i/o and tried increasing the input buffer size without success. Since it's clear that the new tty is not ready, I'll go back to my modified version of the Paradis driver, which also sort of works but much better. I still don't think acceptable throughput (== 19200 baud on a 5MHz 8088 with no load) can be attained without "fixing" the kernel. The problem is that interrupts are disabled for several millisec. I get about 4800 baud with lots of kernel changes. (Old) mkfs ========== My last report complained about getline() in mkfs.c not checking for EOF properly (mainly hurts when the compiler uses unsigned characters, but wrong anyway). More seriously, it doesn't check for buffer overflow and goes crazy when the EOF test fails. I have a line "mkfs /dev/ram 384" in /etc/rc. After I somehow created an empty file /384, mkfs crashed. I just deleted /384, but the proper method is to avoid truncating the value returned from getc() by storing it in a char variable. Divide by zero trap and signals =============================== Division by zero now gives _repeated_ traps before the divider finally dies. The V1.3 main.c does this after divide by zero: cause_sig(cur_proc, SIGILL); /* send signal to current process */ unready(proc_addr(cur_proc)); /* deschedule current process */ This probably worked in 1.2, but the V1.3 cause_sig() now calls unready resulting in a pick_proc() so cur_proc usually changes. My fix was just to save the original cur_proc for unready()ing. I'm still worried about this. First, if cur_proc is a task or server (can't happen :-)) there should be a panic. Is it safe to go straight to cause_sig() without messaging the system task? Perhaps the PUBLIC routines in system.c belong in proc.c to show such safety? The failure mode for this bug was instructive. Assume no hardware interrupts to confuse us except clock ticks. Assume we just omit the unready() in the trap, which is effectively what the bug does since unreadying the idle process seems to be harmless. (1) cause_sig() calls unready(cur_proc), unready() calls pick_proc(). It may be better for unready() not to call pick_proc(), to avoid situations like this. Note that ready() does not call pick_proc(). Are there situations where a newly readied process languishes on the ready queue while the idle process runs till the next interrupt? (2) cause_sig() calls inform() which calls mini_send() which readies MM. Then it readies the process to be signalled. This seems wrong(???), since the signal should be the next thing the process does. I tried removing the ready() but since the other signal code apparently doesn't ready the process to be signaled, the signal was long delayed (till the next update alarm on my system, perhaps since I have changed the low-level clock code to call the clock task less). (3) Now MM and the offending process are ready but IDLE is running. There is an unnecessary delay till the next clock tick. There should probably be a call to pick_proc() just after the mini_send() in inform(). (4) The clock task runs at the next clock tick. When it is unreadied, MM is readied at last. MM needs to do a core dump for SIGILL, so lots of other system activity occurs. (Another fishy piece of code is the pick_proc() in sched() called by the clock task (not now). Since the clock task is running, it is at the head of the task queue and will be picked again. The new user is only picked after the clock task blocks.) (5) When the system waits for i/o, it sees the process to be signalled is ready, runs it, and incurs another trap! It is saved from recursion because the 1st signal eventually completes and kills the process. I always got exactly 2 traps. Kernel error numbers and bad message addresses ============================================== In the profiling support routine monstartup(), sbrk() may be inadvertently called with a negative argument. (Unlike malloc(), sbrk() can reduce the break.) I won't fix profil here since it is non-standard, but look at the effect: the break may get reduced below the global message address, after which _all_ system calls, including sbrk() to fix the problem, and exit() to give up, fail! The simplest fix is to SIGKILL any process which gives a bad message address. This is also an example of the problem with ambiguous error numbers. The kernel returns E_BAD_ADDR which is the same as an unrelated user error number. The kernel should never return internal error numbers, so these must be made unambiguous and turned into user numbers, or converted before return. Bruce Evans Internet: brucee@runx.ips.oz.au UUCP: uunet!runx.ips.oz.au!brucee
Ralf.Brown@B.GP.CS.CMU.EDU (07/19/88)
In article <1648@runx.ips.oz>, brucee@runx.ips.oz (Bruce Evans) writes: }much better. I still don't think acceptable throughput (== 19200 baud }on a 5MHz 8088 with no load) can be attained without "fixing" the }kernel. The problem is that interrupts are disabled for several }millisec. I get about 4800 baud with lots of kernel changes. I would suggest adding support for the 16550 UART in tty.c. This chip, which is pin-compatible with the 8250 and 16450, has a 16-byte on-chip receive buffer, so an incoming 9600-baud data stream need only be read 60 times per second to prevent data loss. -- UUCP: {ucbvax,harvard}!cs.cmu.edu!ralf -=-=-=- Voice: (412) 268-3053 (school) ARPA: ralf@cs.cmu.edu BIT: ralf%cs.cmu.edu@CMUCCVMA FIDO: Ralf Brown 1:129/31 Disclaimer? I |Ducharm's Axiom: If you view your problem closely enough claimed something?| you will recognize yourself as part of the problem.
bdale@hpcsrc.HP.COM (Bdale Garbee) (08/03/88)
>I would suggest adding support for the 16550 UART in tty.c. This chip, >which is pin-compatible with the 8250 and 16450, has a 16-byte on-chip >receive buffer, so an incoming 9600-baud data stream need only be read >60 times per second to prevent data loss. Watch out. This is indeed a wonderful chip, but it's not *quite* pin compatible... read the 16550 data sheet. It should be trivial to get one working with Minix in a PC clone... I may try it myself eventually. Bdale