[comp.os.minix] Bugs found installing V1.3

brucee@runx.ips.oz (Bruce Evans) (07/08/88)

I found the following bugs while beating hard on the system recently, doing
Minix development on Minix for the first time and installing the 1.3 diffs.
The system held up very well and my only major dis-satisfaction is the
slowness of fs+drivers.

Most of the bugs are still in 1.3. All diffs are relative to 1.3.

lib
===

sbrk() doesn't check underflow
------------------------------
Sbrk() in brk.c still doesn't do the bounds checking right. 1.2 didn't check
anything. 1.3 should check underflow as well as overflow.
Sbrk makes the undocumented assuption that the initial brksize is even.
It also declares unused undefined externs endv and dorgv. 

brk.c.diff:
-----------
24d23
<   extern int endv, dorgv;
28c27,28
<   if (incr > 0 && newsize < oldsize) return( (char *) -1);
---
>   if (incr > 0 && newsize < oldsize || incr < 0 && newsize > oldsize)
>   	return( (char *) -1);
-----------

ctime fails on Dec 31 of leap years
-------------------------------------
This bug has already been reported for the date command. My fix is simpler.
In ctime.c, too much is subtracted from the 't' variable, allowing it to go
negative. I checked the fixed version against another ctime implementation.
They agreed every hour 1970 to 1999. Minix ctime still fails miserably
before 1970 and more reasonably after 2000. This is for the upgraded ctime
posted just after the 1.3 diffs.

ctime.c.diff
------------
47a48,50
> 		{
> 			if ( t < YEAR + DAY )
> 				break;
48a52
> 		}
------------

fopen.c needs to include <errno.h>
----------------------------------
to get at errno and ENOENT.

commands
========

asld divides by zero
--------------------
ZERO	=	/0
causes this. (Believe it or not, "/" means "0x" for at least one Xenix
assembler.)

echo discards quotes and fails with lots of arguments
-----------------------------------------------------
The buffering method in echo fails on arguments falling across the buffer
boundary. Since the buffer size is 2048 and exec used to provide at most
1024 bytes, no one noticed. With the new exec size of 2048 it should show
up, but I haven't re-compiled enough of 1.3 to check.

fdisk incompatible with DOS 3.3 fdisk
-------------------------------------
Partitions built with fdisk cause DOS 3.3 fdisk to crash, because of an
improper value in the incompletely documented sysind field.

More advice needs to be given on setting up Minix partitions. I have done
it many times, but still have trouble. DOS 3.3 fdisk at least _formats_
the partitions it sets up, so you can't modify an existing partition
without losing its data. I found this out the hard way after losing the
Minix partition table entry when DOS fdisk thought it was linked to the
DOS entry (it was, because although I used Minix fdisk to set up the
Minix partition, I changed the sysind field from 0 to the (linking) value
generated by DOS fdisk to stop DOS fdisk from crashing). Another important
thing is to reboot after changing the partition table. If you think that
the new values are effective immediately, it's easy to mkfs the wrong
partition.

getc() assigned to character variable
-------------------------------------
The following commands assign the value returned by getc() or getchar() to
a character variable which may be tested against EOF later. This fails when
characters are unsigned, and gives a premature EOF when the file contains
(char) EOF anyway. A couple of these test the getc() value is <= 0, but 
only == EOF and < 0 are correct. A global search produced:
 
	ast.c	readaline()
	df.c	getname()
	fdisk.c	get_a_char()
	mkfs.c	getline()
	prep.c	main(), skipline(), backslash()
	uniq.c	getline()

gets() buffer overflow possible
-------------------------------
Several commands, e.g. date and fdisk, use gets when fgets() would be better
because it doesn't allow the buffer to overflow.

head.s needs to define environ
-------------------------------
Init calls execn() which is from exec.c and something there references
environ. Normally environ is defined in crtso.s but head.s is linked instead
for mm/fs/init.

I think exec.c and the other library files which declare lots of functions
should be split up. This will make the source distribution larger (not much
if it is archived) and the binaries a little smaller.

Another problem with init and the environment is that the 1.2 version used
the 1.2 execn() which doesn't set up the environment. This didn't hurt me in
1.2 (why?) but in 1.3 the sh working on /etc/rc ran out of string space
trying to set up a junk environment.

mkfs allows file systems larger than device
-------------------------------------------
Mkfs.c attempts to test for file systems too large for the device by writing
to the block after the last, reading it back, and comparing. This doesn't
work because the i/o goes into the cache and always succeeds. (Eventually the
cache is flushed and an error is generated. The error handling for devices
with a size is confusing: the error is EOF which is ignored.) Thus errors from
the RAM disk are completely ignored. I fixed this by doing a sync() between
the write() and read().

mv messes up setuid bit and aliases
-----------------------------------
Doesn't preserve setuid bit when it uses cp.
mkdir a; cd a; > a; mv a ../a silently deletes file a.
mkdir a; cd a; > a; mv a .. silently deletes file a (directory a seems OK).

rm(?) deletes "." and ".."
--------------------------
Several times I somehow deleted the "." and ".." entries in a directory, using
a command like rm -r ../anotherdir. The file system remained valid (fsck
should really find this error), and the rest of the files in the directory
were recovered by going up a level and doing mv `ls baddir` gooddir.

sh quoting bugs
---------------
a=`echo b c` reduces to a=b c, so assigns a=b and executes c
Xenix sh assigns the entire output of a graved command (independent of quotes
in the output).
a="b c"; d=$a evaluates as d=b c and fails
a='"b c"'; d=$a evaluates as d=b 'c"'
a='"b c"'; d="$a" does the same
a=a; b=' '; c=c; d=$a$b$c evaluates as d=a c
a='"a"'; b='"b"', then
	c=$a$b works
	c="$a$b" eats the first quote off $a
I couldn't find a general way to make the shell concatenate 2 variables!!!!

tty
===

When 2 processes try to do tty i/o at once, the 2nd fails in bad ways. The
worst is with multiple writes. This can only happen after the 1st writer
is suspended. The 1st writer's state is overwritten, in particular the
process number on line 3917 of the book. Without the process number,
un-suspension is impossible. Killing the stuck process used to kill the
system but that seems to have stopped. To see this bug, start several
processes doing
   main() { while(1) printf("*"), sleep(2); }
and hit ^S^Q.

Fixing this properly would be best done by holding the state in fs. The 2nd
writer (or reader) should not get an error either. Fs can handle this by a
suspension which records the i/o request in the proposed state entry. A FIFO
queue can be used to select outstanding i/o requests when the device becomes
free. I made a related change a few months ago to allow fs to parcel out
humungous write requests in pieces acceptable to the driver. The driver
just returns a bytes_written count < bytes_to_write for automatic suspension.
Very few programs are written to handle incomplete writes like this, and
they shouldn't have to. Perhaps large writes to pipes can be broken up in
similarly.

When a 2nd process tries to read from tty, the error code E_TRY_AGAIN is
returned (line 3793 of the book). I think this error number filters back
to the caller, though it is supposed to be for the kernel's internal use.
Anyway the read fails and few programs can handle the errno. A simple
example is
   set | more
Output from set is nonstandard. More grabs the tty for reading and sh input
fails so sh logs out.

zero taken as EOF
-----------------
This has been fixed in a few commands (e.g. wc), but it's still in "more",
which confuses 0 from the buffer with the value returned by input() for EOF.

non-bugs
========

elle wishlist
-------------
Editing a linked file shouldn't destroy the link. (The .bak file stays linked.)
Shelling out should preserve the environment.
Backup files shouldn't be created automatically (hate cleaning them up).
Quitting shouldn't require confirmation.

mount wishlist
--------------
If the floppies are writable by everybody, it's too easy to write on a mounted
floppy. Mount should change the permissions to stop this. Just like login
(should) change tty permissions to protect logins from being read by others.

writing to stdin, reading from stdout and stderr should be allowed (?)
----------------------------------------------------------------------
Some Unix programs actually need this, e.g. compress and patch. Xenix certainly
allows both reading and writing to file descriptors 0, 1 and 2, provided they
have not been redirected by the shell. I think the reason is to allow stderr
at least to serve as a channel for handling input in response to error output,
after stdin has been redirected.

This can be arranged by changing all the open modes for ttys in init.c to 2.
I have been running this for 3 months with no problems.

zombie externs and conflicting use of extern library names
----------------------------------------------------------
Some commands and library routines declare unused variables as extern, and
the variables are not even defined externally.

It's dangerous to redefine a function in the library. If the new definition
is incompatible, everyone is confused and library routines may call the
wrong function (or non-function - some linkers don't complain even if printf
is declared as an int). If it's supposed to be compatible, time is wasted
and there's more chance of bugs. (When ctime is fixed, we can rip out lots
of buggy date routines.)

My linker is fussy and found these.

df.c:	declares the unused undefined extern super_block via super.h 
ed.c:	defines its own incompatible version of setbuf()
ed.c:	defines its own compatible(?) version of system()
make.c:	defines its own compatible(?) version of getenv()
mknod.c: defines its own compatible(?) version of atoi()
readfs.c: declares numerous unused undefined externs via buf.h and super.h 
shar.c:	defines index as a global int
shar.c:	defines its own incompatible version of putchar()
sort.c:	defines a global function incr() which is global int in termcap.c.
	This is termcap's fault.
termcap.c: declares unused undefined externs ospeed, PC, BC, UP
touch.c:defines its own compatible(?) version of std_err()
uudecode:defines its own compatible(?) version of index()

Bruce Evans
Internet: brucee@runx.ips.oz.au    UUCP: uunet!runx.ips.oz.au!brucee

brucee@runx.ips.oz (Bruce Evans) (07/13/88)

I found some more bugs. These bit harder so I found all reasons and some
fixes.

New tty
=======

(1) rs_struct[1].rs_base in tty.c init_rs232() is initialised when there is
no rs_struct[1]. My vid_base was was destroyed :-). Make this conditional:

#if NR_RS_LINES > 1
  rs_struct[1].rs_base = SECONDARY;
#endif

(2) SMALL_STACK in table.c is too small for TTY. The problem shows up mainly
mainly with the F2 dump. Printk uses a lot of stack. My proc_ptr was
destroyed first :-). Increase the size from 256 to 512:

#define	TTY_STACK	512		/* SMALL_STACK is just too small */

(3) base in tty.c config_rs232() uses the wrong rs232-line number. The
rs_struct's are not offset by NR_CONS. All this bug does is stop rs232 working.

base = rs_struct[line].rs_base;

(4) The keyboard initialization is probably not solid. I use a boot program
in which the equivalent of the '=' key can be hit after loading everything
from the disk, and the system can lock up. I think there is no time for
the BIOS to handle the key release, and the keyboard handler doesn't
see it because I reprogrammed the interrupt controller (to undo the
misplacement of the device vectors).  My problem was easy to fix by
strobing the keyboard during console initialization.

Perhaps there are other devices with incomplete initializations
depending on undocumented states left by the BIOS, (the screen of
course)? In any case, it is WRONG to enable interrupts before the
devices have been initialised (in main.c). I also don't like the
centralization of setting up device vectors.

After fixing all these and merging the local changes which interface to
my debugger (switching to a stand-alone set of i/o routines requires
support from the O/S since too many hardware registers are write-only),
the new tty worked, sort of. I took out the debugging statement which
echos all rs232 output to the screen. It worked at as a dumb terminal
at 9600 baud on my 386 system. Zmodem output worked, zmodem input
didn't. I thought the problem might be with a too cavalier approach to
discarding i/o and tried increasing the input buffer size without
success.

Since it's clear that the new tty is not ready, I'll go back to my
modified version of the Paradis driver, which also sort of works but
much better. I still don't think acceptable throughput (== 19200 baud
on a 5MHz 8088 with no load) can be attained without "fixing" the
kernel. The problem is that interrupts are disabled for several
millisec. I get about 4800 baud with lots of kernel changes.

(Old) mkfs
==========

My last report complained about getline() in mkfs.c not checking for
EOF properly (mainly hurts when the compiler uses unsigned characters,
but wrong anyway).  More seriously, it doesn't check for buffer
overflow and goes crazy when the EOF test fails. I have a line
"mkfs /dev/ram 384" in /etc/rc. After I somehow created an empty file
/384, mkfs crashed. I just deleted /384, but the proper method is to
avoid truncating the value returned from getc() by storing it in a
char variable.

Divide by zero trap and signals
===============================

Division by zero now gives _repeated_ traps before the divider finally
dies.

The V1.3 main.c does this after divide by zero:

cause_sig(cur_proc, SIGILL);    /* send signal to current process */
unready(proc_addr(cur_proc));   /* deschedule current process */

This probably worked in 1.2, but the V1.3 cause_sig() now calls unready
resulting in a pick_proc() so cur_proc usually changes. My fix was just
to save the original cur_proc for unready()ing.

I'm still worried about this. First, if cur_proc is a task or server
(can't happen :-)) there should be a panic. Is it safe to go straight
to cause_sig() without messaging the system task? Perhaps the PUBLIC
routines in system.c belong in proc.c to show such safety?

The failure mode for this bug was instructive. Assume no hardware
interrupts to confuse us except clock ticks. Assume we just omit the
unready() in the trap, which is effectively what the bug does since
unreadying the idle process seems to be harmless.

(1) cause_sig() calls unready(cur_proc), unready() calls pick_proc().
It may be better for unready() not to call pick_proc(), to avoid
situations like this. Note that ready() does not call pick_proc().  Are
there situations where a newly readied process languishes on the ready
queue while the idle process runs till the next interrupt?

(2) cause_sig() calls inform() which calls mini_send() which readies
MM. Then it readies the process to be signalled. This seems wrong(???),
since the signal should be the next thing the process does. I tried
removing the ready() but since the other signal code apparently doesn't
ready the process to be signaled, the signal was long delayed (till the
next update alarm on my system, perhaps since I have changed the
low-level clock code to call the clock task less).

(3) Now MM and the offending process are ready but IDLE is running.
There is an unnecessary delay till the next clock tick. There should
probably be a call to pick_proc() just after the mini_send() in
inform().

(4) The clock task runs at the next clock tick. When it is unreadied,
MM is readied at last.  MM needs to do a core dump for SIGILL, so lots
of other system activity occurs.  (Another fishy piece of code is the
pick_proc() in sched() called by the clock task (not now). Since the
clock task is running, it is at the head of the task queue and will be
picked again.  The new user is only picked after the clock task
blocks.)

(5) When the system waits for i/o, it sees the process to be signalled
is ready, runs it, and incurs another trap! It is saved from recursion
because the 1st signal eventually completes and kills the process. I
always got exactly 2 traps.

Kernel error numbers and bad message addresses
==============================================

In the profiling support routine monstartup(), sbrk() may be
inadvertently called with a negative argument. (Unlike malloc(), sbrk()
can reduce the break.) I won't fix profil here since it is non-standard,
but look at the effect: the break may get reduced below the global
message address, after which _all_ system calls, including sbrk() to
fix the problem, and exit() to give up, fail!

The simplest fix is to SIGKILL any process which gives a bad message
address.

This is also an example of the problem with ambiguous error numbers.
The kernel returns E_BAD_ADDR which is the same as an unrelated user
error number. The kernel should never return internal error numbers,
so these must be made unambiguous and turned into user numbers, or
converted before return.

Bruce Evans
Internet: brucee@runx.ips.oz.au    UUCP: uunet!runx.ips.oz.au!brucee

Ralf.Brown@B.GP.CS.CMU.EDU (07/19/88)

In article <1648@runx.ips.oz>, brucee@runx.ips.oz (Bruce Evans) writes:
}much better. I still don't think acceptable throughput (== 19200 baud
}on a 5MHz 8088 with no load) can be attained without "fixing" the
}kernel. The problem is that interrupts are disabled for several
}millisec. I get about 4800 baud with lots of kernel changes.

I would suggest adding support for the 16550 UART in tty.c.  This chip,
which is pin-compatible with the 8250 and 16450, has a 16-byte on-chip
receive buffer, so an incoming 9600-baud data stream need only be read
60 times per second to prevent data loss.

--
UUCP: {ucbvax,harvard}!cs.cmu.edu!ralf -=-=-=- Voice: (412) 268-3053 (school)
ARPA: ralf@cs.cmu.edu  BIT: ralf%cs.cmu.edu@CMUCCVMA  FIDO: Ralf Brown 1:129/31
Disclaimer? I     |Ducharm's Axiom:  If you view your problem closely enough
claimed something?|   you will recognize yourself as part of the problem.

bdale@hpcsrc.HP.COM (Bdale Garbee) (08/03/88)

>I would suggest adding support for the 16550 UART in tty.c.  This chip,
>which is pin-compatible with the 8250 and 16450, has a 16-byte on-chip
>receive buffer, so an incoming 9600-baud data stream need only be read
>60 times per second to prevent data loss.

Watch out.  This is indeed a wonderful chip, but it's not *quite* pin
compatible... read the 16550 data sheet.  It should be trivial to get one
working with Minix in a PC clone... I may try it myself eventually.

Bdale