[comp.os.minix] Miscellaneous bug fixes

brucee@runx.ips.oz (Bruce Evans) (02/15/88)
    This is a report on some of the Minix bugs I have attacked.  None
    of these have been discussed fully in comp.os.minix as far as I know.

    The most important fixes are for
(1)
    shell does not initialise a signal-catching longjump vector properly.
(2)
    kernel uses uninitialised processor direction flag.
(3)
    kernel does locks without unlock/restore.

    I have provided fairly complete descriptions for the fixes, rather
    than diffs, because the actual differences are small and easy to do
    by hand, and small differences tend to be ambiguous (without
    context diffs).  A reference to The Book is given if possible,
    otherwise to V1.2+ sources.  Some of the bugs are timing-dependent
    and may only show up on a slow PC like mine.  It is possible that
    the bugs noted but not fixed are only compiler bugs.


SYSTEM BUGS
-----------

Direction flag
--------------
    In a recent posting, Bing Bang said that the direction flag was not
    being cleared properly in kernel/klib88.s, wth cp_mess() being the
    only (?) thing messed up.  In fact the main routines in klib88.s do
    not initialise it at all (after an interrupt)! So phys_copy(),
    vid_copy(), resvec(), and scr_up() for the new tty driver are also
    affected.  scr_down() of course needs the opposite direction and
    sets and restores it correctly.

    It is amazing that this has caused so few problems.  The kernel is
    vulnerable while any library routine has the direction flag set.
    I use an assembler version of memcpy() which sets it, but cannot
    attribute any crashes directly to this bug.

    The kernel ought to clear the direction flag on every interrupt
    (the processor ought to but doesn't) and keep it clear except in
    subroutines.  This requires some cooperation from the compiler and
    library routines.  Some compilers also require it.  For now, it's
    simpler and more portable to clear the flag in every routine
    needing it.

Fix:
    Add lines "cld" to klib88.s in the following places.
    Your line numbers probably differ.
            after  "cli"  in phys_copy()  (line 46)
                later would be clearer but don't put it in a loop
            before  "rep"  in cp_mess()   (line 168)
                best place
            after "cli" in vid_copy()     (line 482)
            before  "rep"  in scr_up()    (line 546??)
            before  "rep"  in res_vec()   (line 802??)

    Fixing this is supposed to let us avoid turning interrupts off for
    so long and get better response.  I tried deleting the "cli" in
    phys_copy() (the one in cp_mess() is usually redundant and the new
    tty driver does not use vid_copy()).  This seems to work but not
    noticeably change the response time, but needs further testing.


Floppy driver
-------------
    The variable fl_curcyl (part of struct floppy) is never initialised.
    (It is "cleared" in various places.)
Fix:
    Add the line
            fp->fl_curcyl = fp->fl_cylinder;
    just before the successful return from seek() in floppy.c,
    line 2785 (Book).

Effect:
    This might have wasted a lot of time doing unnecessary seeks, but
    it appears that the disk controller doesn't need to do any extra
    work, and in practice the floppy did not go any faster when the
    change was made.  I only found this bug when tracing another one. 
    The busy_map flags set by interrupt() turn out to be rarely set. 
    But the seek command was setting them a lot, probably because the
    controller realised a seek was unnecessary and responded instantly.


Lock/restore
------------
    It is difficult to work with the lock/unlock-restore functions to
    control interrupts.  Several better methods have already been
    mentioned in comp.os.minix.  I haven't implemented any of
    these, but to track down an interrupt-related bug I added some
    counters to the functions.  These detect locks being overwritten. 
    The following problems showed up.  All are caused by lock calls
    without a corresponding unlock.  They seem to be fairly harmless,
    but who knows.

(1)
    kernel/proc.c, unready(), lines 2164 and 2174 (Book).
    The function has done a lock, but returns without an unlock.
Fix:
    Replace
            ... return;
    by
            ... { restore(); return; }
    The 2nd return is executed fairly consistently when a shell is
    killed.
(2)
    kernel/printer.c, do_write(), line 111.
    A lock is done, but there is no corresponding unlock in the whole
    driver! I'm not sure how to fix this.  It would probably work to
    put an unlock when the function returns, but there are some
    busy-wait loops so this is not a good solution.
(3)
    kernel/main.c, main(), line 902 (Book) (first line of C executed).
    A lock is done without a corresponding unlock.  This is certainly
    harmless, but is redundant because the lock has already been done
    by a "cli" in fsck.s and again in head.s.  It is confusing to have
    it here, since a system wouldn't normally work if interrupts were
    on at the start (but the BIOS interrupt handlers are valid here, so
    it probably doesn't matter).


Reading and writing
-------------------
    Doing zero bytes of i/o to a bad file number should probably be an
    error.  I had a bug which caused this strange activity, but Minix
    did not complain so the error handler did not get called.


Signal system call
------------------
    Probably needs to return more complete status. See section on
    library signal().


Writing to /dev/null
--------------------
    The recent official patch to fs/read.c breaks writing to /dev/null. 
    It is for rw_chunk(), line 9963 (Book), and involves adding a
    test
            && !block_spec
    to the others on the line.  Since /dev/null is block special, the
    test now fails where it used to succeed.  The result is that
    get_block is called for /dev/null.  The driver returns EOF which
    gets translated into a funny error number (104) and the write fails.
Fix: (?)
    Make /dev/null character special instead of block special.
            su; cd /dev; rm null; mknod null c 1 3
    It certainly doesn't make any sense to cache /dev/null - it
    probably has been wasting a block.  The change makes /dev/null much
    faster; 30 sec reduced to 20 sec for copying 400K from hard disk to it.
Further problem:
    isatty() now returns TRUE for /dev/null, as it does for all
    character special devices.


COMMANDS BUGS
-------------

chmem
-----
    chmem.c, line 46
            if ( (header[0] & 0xFFFF) != MAGIC)
    This mask is non-portable. header[0] is long, and K & R say
    (long) 0xFFFF == -1 == 0xFFFFFFFF which is not what is wanted.
    Modern PC compilers mostly say otherwise.
Fix:
            if ( (header[0] & 0xFFFFL) != MAGIC)

    There are also some byte-order portability problems.  I ran Minix
    for a while with the wierd long byte order of 2-3-0-1 (I think it
    came from something justified on a PDP11, but stupid on an 8088). 
    Minix itself ran fine but there were a few problems with the
    utilities (mkfs is the main problem, fdisk a minor problem, and
    fsck no problem).

grep
----
    Doesn't stop at end of file.
Cause:
    Value returned by getc() on line 158 of grep.c is assigned to a
    char variable.  This doesn't matter much if characters are signed
    (but char 0xff will give a false eof), but my compiler uses
    unsigned characters (which are more efficient on the 8088).  The
    effect was that EOF was treated as 0x00ff and not 0xffff, a valid
    char, and grep received an infinitely long line of them.
Fix:
    Replace
            char *initbuf = buf, c;
    by
            char *initbuf;
            int c;
    Please declare variables on separate lines.

pwd
---
    pwd.c, line 8.
    The version I received is missing a semicolon after a structure
    declaration.  This leads to main() being declared as returning
    a struct direct.

pr
--
    The new version prints garbage in the last few positions when
    multiple columns are used.  Also, memory is probably not freed when
    it could be.
Cause:
    The carefully defined NIL_PTR is dereferenced on line 264 of pr.c,
    resulting in garbage null pointers.
Fix:
    Change
            ...  = *(char **) NIL_PTR;
    to
            ...  = NIL_PTR;
    To be consistent, the (logical) comparison with 0 on line 347
    should also be replaced with an explicit pointer comparsion.

sh
--
(1) 
Symptoms:
    Holding down the DEL key sometimes causes shell to logout, sometimes
    to hang. The exact behaviour varied as I changed other parts of
    the system (own clock handler mods, new tty driver, login mods
    posted by Peter Housel).
Cause:
    SIGINTs generated by the DEL key are not safely trapped in sh1.c. 
    The relevant lines are 224-226.
            if (talking)
                signal(SIGINT, onintr);
            if (setjmp(failpt = m1) || yyparse() || intr) {
    Here m1 is a  local  jump buffer and  failpt  is a  global  pointer
    to it.  failpt needs to be initialised before the signal(), because
    it is used by the signal catcher  onintr().  Any sigint caught in
    the huge :-) gap between the signal() and the setjmp() will activate
    this bug.
Fix:
    Just before these lines, add
            setjmp(failpt = m1);
    This setjmp() should always return directly, not through a
    longjump(), since onintr() is inactive here.  (It had better be,
    because failpt is certain to be garbage.  It would be sensible to
    deactivate failpt explicitly in the signal catcher and make it
    abort with an "Can't happen" type error message.)
Unfinished business:
    Holding down the DEL key STILL kills the shell in some situations. 
    When the system is overloaded, this is probably an inherent problem
    with *ix-style signal handling.  The shell catches a SIGINT and
    attempts to signal a catch for the next one, but the kernel
    services the SIGINT before setting up the catcher.  This problem
    can be consistently duplicated by doing a large cp in the
    background.  Unfortunately, it still happens (much less than
    without this fix) when the shell is on its own.  At least it no
    longer hangs.

    There's also something strange about the way this bug is so easy
    to duplicate, since the window of vulnerability is so small. It
    looks like the kernel is queueing the signal and dequeueing it
    when the catcher is installed, but I couldn't see anything like
    that in the code. Something in the login mods made the bug
    more apparent. The old login (except for the first) had all
    signals except those the shell sets ignored. The key SIGINT is
    not directly affected. Maybe just looking non-ignored signals
    clogs up the signal processing.

    The standard Minix longjmp() (which I don't use) backtraces the
    frame pointers and might have caught this bug. It just halts
    the processor (but not safely, the first interrupt will cancel
    the halt). Maybe the bug doesn't show up in standard Minix.

(2)
    Type at the prompt
            for i in 1 2 <RETURN> <DEL>
    Then the next attempt at a "for" command gives a sytax error. 
    Logging out cures this of course.  Doing a "set" command also
    works.

size
----
    size.c, line 49
            if ( (head[0] & 0xFFFF) != MAGIC)
    Same bug as in chmem.c.
Fix:
            if ( (head[0] & 0xFFFFL) != MAGIC)

su
--
    Does not pass on the environment.  See also execn().

time
----
    Uses its own version of $PATH (:/bin:/usr/bin).  Does not pass on
    the environment.

treecmp
-------
    Doesn't handle file names of length 14 properly.  Probably not null
    terminated.

wc
--
    wc.c, line 130
            while((c = getc(stdin)) > 0) {
    Surely 0 is a valid char? wc doesn't count executables too well
    with this. The Minix header stops them very early.
Fix:
            while((c = getc(stdin)) >= 0) {


LIBRARY
-------

execn()
-------
    Doesn't set up an adequate environment.  I forget the details, it
    may be that argv[0] is null (ps has trouble with the commmand name).
    Affects init and (much worse) su.

signal()
--------
    Part of the work for the Minix signal() system call is done by the
    library routine, because the kernel only keeps track of one signal
    catcher and not one for each signal.  The library routine tries to
    keep track of the individual catchers and the status of uncaught
    signals (SIG_DFL or SIGN_IGN) in an array.  This doesn't work for
    ignored signals after a fork-exec when the child wants to ignore
    some signals.  The array is in bss and gets 0-initialised to
    SIG_DFL.  The kernel just returns 0 all successful calls.
Affect:
    e.g.
            dd < bigfile > bigcopy &
    The dd in the background is aborted by all foreground SIGINTS.
Suggested fix:
    Have the kernel return
            SIGN_DFL    for previously defaulted signals
            SIG_IGN     for previously ignored signals
            catcher     for previously caught signals (easier to check than 0)
    The library routine would then have enough information to do its job.

system()
--------
    Doesn't check for failure of fork(). I think this is what causes
    commands with multiple pipes to often hang.
    Should probably use execlp (when available) instead of execl.
    Should probably pass on the environment.

Bruce Evans (brucee@runx.ips.oz)