[comp.os.os9] Interrupt handling error in most OSK drivers ???

zeller@ethz.UUCP (Lukas Zeller) (08/16/89)

I am using and programming OS-9/68k for several years now.  I  have  written
some drivers from scratch and I have  modified  many  existing  drivers  for
version updates and system ports. The question  I'd  like  to  ask  the  net
raised from this experience. In particular, I had  to  fix  various  drivers
that tended to "hang" *sometimes* and, in consequence, caused the system  to
block.
I hope there are some OSK gurus out there on the net  (hello,  Microware  !)
who can give an answer.

The problem can occur in all I/O-drivers that initiate some  action  in  the
main process and then do an infinte sleep.  The  completion  of  the  action
generates an interupt, which causes the main process to be woken up. If  for
some obscure reason this interrupt gets lost, obviously,  the  main  process
will never stop sleeping and the driver hangs.  Aside  from  real  "obscure"
reasons that eat up interrupts before they can  be  serviced  there  is  one
possibility for this to happen inherent in *all* original Microware  drivers
I know of (and all other drivers derived from Microware code,  which  covers
most of all existing drivers):

The standard outline for an interrupt controlled I/O driver is  as  follows,
according to existing source code as well as to P.Dibble's "OS-9  Insights",
paragraph 20.6:

   repeat
        mask interrupts
        if (IO request cannot be satisfied until hardware
            generates an interrupt) then
            UNMASK INTERRUPTS
            sleep
            continue
   until (IO request can be satisfied)

Now, what happens if the interrupt occurs *after* the decision that we  need
to wait for an interrupt, but *before* the main  process  is  asleep  ?  The
interrupt routine is called immediately after the "UNMASK  INTERRUPTS"  step
and sends a wakeup signal to the main process. But the main process  is  not
sleeping yet and thus  the  wakeup  signal  is  ignored  (according  to  the
documentation S$Wake insures only that the process is running and will *not*
be queued). Then the main process goes to sleep and will remain sleeping for
ever, because the wakeup event has occurred already before it went to sleep.

This problem is *not* a theoretical one at all. For example, when I  had  to
replace an old, slow SCSI controller  with  a  new,  fast  one,  the  system
suddenly hung *sometimes*: While the old controller was simply  slow  enough
to ensure that the interrupt was issued *always* after the main process  was
asleep, this was not true for the new controller. Sometimes, it responded so
quickly that the interrupt got  served  before  the  F$Sleep  call.  Similar
problems occurred to me with  several  other  drivers  from  many  different
sources. As said above, the prerequisites for  this  problem  are  given  in
virtually all existing  drivers,  but  it  does  actually  occur  with  fast
hardware only.

But how to avoid this problem ? The conditions are obvious:  The  interrupts
MUST NOT BE ENABLED BEFORE THE MAIN PROCESS IS ASLEEP. The only way to match
this condition is to call F$Sleep WHILE THE INTERRUPTS ARE  STILL  DISABLED,
and relying on the F$Sleep itself enabling the interrupts when it is safe. I
could not find any hints in the documentation whether this is legal or  not,
but the experiments done by several members of our local OS-9 interst  group
shows: IT WORKS. We modified most of our drivers of all types (SCF, RBF, SBF
and even NFM) and had no problems yet (there is one caveat described at  the
end of this message), and we have used this technique for more than  a  year
now.

So the problem is solved for practical purposes. But the solution  is  still
based on experiment only, and therefore we cannot be sure that it will  work
in all systems, although it seems like.
Also, we were very puzzled to  recognize  during  the  last  year  that  the
potential problem is not only in some european VME card manufacturer's  OS-9
ports (which show - sad, but true in our experience - very poor  programming
in general), but in many other drivers  of  excellent  programming  quality.
This applies even to the sample 68681 driver described in  "OS-9  Insights".
The "wrong way" (in my opinion) seems to be the official one.

As a conclusion of all this, I'd like to ask the following questions:

   -  Any similar or contradictory experiences ?
   -  Is the solution described above "legal" and reliable ? (especially  in
      future versions of OSK).  If  not,  how  can  the  problem  be  solved
      otherwise ?
   -  How could this fault (if I am not completely wrong, it *is*  a  fault)
      propagate through most existing drivers without getting discovered ?

----------------------------------------------------------------------------
For you real OS-9 hackers interested in details: As written above, there  is
one caveat for the solution above in OS-9 V2.2 (most probably also in  V2.1,
but I could not verify this).  As  long  as  the  system  tick  is  running,
everything works fine. But if the tick has not  yet  been  started,  F$Sleep
returns immediately without error, and the interrupt remains  masked  during
the execution of F$Sleep. Thus, drivers that call  F$Sleep  with  interrupts
disabled will hang in this case  unless  special  handling  is  provided.  I
mention this because it caused quite some headache  to  me  when  I  had  to
upgrade a system from V2.0 to V2.2 a few days ago, and the  harddisk  driver
suddenly did not work any more...
----------------------------------------------------------------------------

==========================  +---------------------------+  *****************
      Lukas Zeller          |\         E-Mail:         /|  *    MS-DOS...  *
 ETH Zurich, Switzerland    | \_______________________/ |  *               *
  (SFIT, Swiss Federal      |  /  zeller@ethz.UUCP   \  |  * just say NO ! *
 Institute of Technology)   | / ..cernvax!ethz!zeller \ |  *               *
==========================  +---------------------------+  *****************

ingoldsb@ctycal.COM (Terry Ingoldsby) (08/18/89)

In article <1772@ethz.UUCP>, zeller@ethz.UUCP (Lukas Zeller) writes:
 > I am using and programming OS-9/68k for several years now.  I  have  written
 > some drivers from scratch and I have  modified  many  existing  drivers  for
...
 > The problem can occur in all I/O-drivers that initiate some  action  in  the
 > main process and then do an infinte sleep.  The  completion  of  the  action
... 
 > The standard outline for an interrupt controlled I/O driver is  as  follows,
 > according to existing source code as well as to P.Dibble's "OS-9  Insights",
 > paragraph 20.6:
 > 
 >    repeat
 >         mask interrupts
 >         if (IO request cannot be satisfied until hardware
 >             generates an interrupt) then
 >             UNMASK INTERRUPTS
 >             sleep
 >             continue
 >    until (IO request can be satisfied)
 > 
 > Now, what happens if the interrupt occurs *after* the decision that we  need
 > to wait for an interrupt, but *before* the main  process  is  asleep  ?  The
 > interrupt routine is called immediately after the "UNMASK  INTERRUPTS"  step
 > and sends a wakeup signal to the main process. But the main process  is  not
 > sleeping yet and thus  the  wakeup  signal  is  ignored  (according  to  the
 > documentation S$Wake insures only that the process is running and will *not*
 > be queued). Then the main process goes to sleep and will remain sleeping for
 > ever, because the wakeup event has occurred already before it went to sleep.

Its been a while since I did much driver level programming in OS9, but I seem
to remember wondering how you were supposed to get around this problem.  Is the
fix you propose
 > 
 > MUST NOT BE ENABLED BEFORE THE MAIN PROCESS IS ASLEEP. The only way to match
 > this condition is to call F$Sleep WHILE THE INTERRUPTS ARE  STILL  DISABLED,
 > and relying on the F$Sleep itself enabling the interrupts when it is safe. I

also legal in OS9?  If so, could MicroWare please tell us what other system
calls clear the interrupts?  (both in OSK and OS9).


-- 
  Terry Ingoldsby                       ctycal!ingoldsb@calgary.UUCP
  Land Information Systems                           or
  The City of Calgary         ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb

dibble@cs.rochester.edu (Peter C. Dibble) (08/18/89)

This is in reply to a question about device drivers, in particular the
gap between the time interrupts are unmasked and the F$Sleep to wait
for an interrupt.

The question was, roughly, if an interrupt arives and is serviced
before the driver sleeps and after it commits to sleep, isn't that
interrupt (actually the signal from the interrupt service routine) lost?

===

It turns out that the wakeup signal has a special property that makes it
queue... sort of.  A wakeup signal will either 
activate the target process or set the signal-pending flag for that
process.    It activates the process (basically) if the process is sleeping 
or waiting (either process wait or event wait).
The signal pending flag causes the next sleep (or wait) to
return immediately.  Signal pending is only cleared by sleep or wait.

Wakeup has had this property since OS-9/6809.  I haven't seen it documented.
This is something I should add if I do another edition of Insights.

Keeping interrupts masked right into the F$Sleep will work, but it leaves
interrupts masked for a _long_ time.  It'll cause serious performance
problems on systems that rely on fast interrupt response.

Peter

davidb@braa.inmos.co.uk (David Boreham) (08/21/89)

Yes ! I too have always wondered about this. I am doing a driver
for hardware which almost invariably interrupts *before* the process
sleeps. I would have thought that the F$sleep system call aught to
unmask the interrupt, otherwise, as the previous poster says, the
interrupt will be taken *before* the process is actually sleeping.
Consequently, the signal from the "bottom-half" of the driver will 
be lost forever. I can't believe that this problem is real, but I
can't see what the answer is. (Also, my driver screws-up in a way
consistent with the problem--this proves nothing however:) .

David Boreham, INMOS Limited | mail(uk): davidb@inmos.co.uk or ukc!inmos!davidb
Bristol,  England            |      (us): uunet!inmos-c!davidb
+44 454 616616 ex 543        | Internet : @col.hp.com:davidb@inmos-c