zeller@ethz.UUCP (Lukas Zeller) (08/16/89)
I am using and programming OS-9/68k for several years now. I have written some drivers from scratch and I have modified many existing drivers for version updates and system ports. The question I'd like to ask the net raised from this experience. In particular, I had to fix various drivers that tended to "hang" *sometimes* and, in consequence, caused the system to block. I hope there are some OSK gurus out there on the net (hello, Microware !) who can give an answer. The problem can occur in all I/O-drivers that initiate some action in the main process and then do an infinte sleep. The completion of the action generates an interupt, which causes the main process to be woken up. If for some obscure reason this interrupt gets lost, obviously, the main process will never stop sleeping and the driver hangs. Aside from real "obscure" reasons that eat up interrupts before they can be serviced there is one possibility for this to happen inherent in *all* original Microware drivers I know of (and all other drivers derived from Microware code, which covers most of all existing drivers): The standard outline for an interrupt controlled I/O driver is as follows, according to existing source code as well as to P.Dibble's "OS-9 Insights", paragraph 20.6: repeat mask interrupts if (IO request cannot be satisfied until hardware generates an interrupt) then UNMASK INTERRUPTS sleep continue until (IO request can be satisfied) Now, what happens if the interrupt occurs *after* the decision that we need to wait for an interrupt, but *before* the main process is asleep ? The interrupt routine is called immediately after the "UNMASK INTERRUPTS" step and sends a wakeup signal to the main process. But the main process is not sleeping yet and thus the wakeup signal is ignored (according to the documentation S$Wake insures only that the process is running and will *not* be queued). Then the main process goes to sleep and will remain sleeping for ever, because the wakeup event has occurred already before it went to sleep. This problem is *not* a theoretical one at all. For example, when I had to replace an old, slow SCSI controller with a new, fast one, the system suddenly hung *sometimes*: While the old controller was simply slow enough to ensure that the interrupt was issued *always* after the main process was asleep, this was not true for the new controller. Sometimes, it responded so quickly that the interrupt got served before the F$Sleep call. Similar problems occurred to me with several other drivers from many different sources. As said above, the prerequisites for this problem are given in virtually all existing drivers, but it does actually occur with fast hardware only. But how to avoid this problem ? The conditions are obvious: The interrupts MUST NOT BE ENABLED BEFORE THE MAIN PROCESS IS ASLEEP. The only way to match this condition is to call F$Sleep WHILE THE INTERRUPTS ARE STILL DISABLED, and relying on the F$Sleep itself enabling the interrupts when it is safe. I could not find any hints in the documentation whether this is legal or not, but the experiments done by several members of our local OS-9 interst group shows: IT WORKS. We modified most of our drivers of all types (SCF, RBF, SBF and even NFM) and had no problems yet (there is one caveat described at the end of this message), and we have used this technique for more than a year now. So the problem is solved for practical purposes. But the solution is still based on experiment only, and therefore we cannot be sure that it will work in all systems, although it seems like. Also, we were very puzzled to recognize during the last year that the potential problem is not only in some european VME card manufacturer's OS-9 ports (which show - sad, but true in our experience - very poor programming in general), but in many other drivers of excellent programming quality. This applies even to the sample 68681 driver described in "OS-9 Insights". The "wrong way" (in my opinion) seems to be the official one. As a conclusion of all this, I'd like to ask the following questions: - Any similar or contradictory experiences ? - Is the solution described above "legal" and reliable ? (especially in future versions of OSK). If not, how can the problem be solved otherwise ? - How could this fault (if I am not completely wrong, it *is* a fault) propagate through most existing drivers without getting discovered ? ---------------------------------------------------------------------------- For you real OS-9 hackers interested in details: As written above, there is one caveat for the solution above in OS-9 V2.2 (most probably also in V2.1, but I could not verify this). As long as the system tick is running, everything works fine. But if the tick has not yet been started, F$Sleep returns immediately without error, and the interrupt remains masked during the execution of F$Sleep. Thus, drivers that call F$Sleep with interrupts disabled will hang in this case unless special handling is provided. I mention this because it caused quite some headache to me when I had to upgrade a system from V2.0 to V2.2 a few days ago, and the harddisk driver suddenly did not work any more... ---------------------------------------------------------------------------- ========================== +---------------------------+ ***************** Lukas Zeller |\ E-Mail: /| * MS-DOS... * ETH Zurich, Switzerland | \_______________________/ | * * (SFIT, Swiss Federal | / zeller@ethz.UUCP \ | * just say NO ! * Institute of Technology) | / ..cernvax!ethz!zeller \ | * * ========================== +---------------------------+ *****************
ingoldsb@ctycal.COM (Terry Ingoldsby) (08/18/89)
In article <1772@ethz.UUCP>, zeller@ethz.UUCP (Lukas Zeller) writes: > I am using and programming OS-9/68k for several years now. I have written > some drivers from scratch and I have modified many existing drivers for ... > The problem can occur in all I/O-drivers that initiate some action in the > main process and then do an infinte sleep. The completion of the action ... > The standard outline for an interrupt controlled I/O driver is as follows, > according to existing source code as well as to P.Dibble's "OS-9 Insights", > paragraph 20.6: > > repeat > mask interrupts > if (IO request cannot be satisfied until hardware > generates an interrupt) then > UNMASK INTERRUPTS > sleep > continue > until (IO request can be satisfied) > > Now, what happens if the interrupt occurs *after* the decision that we need > to wait for an interrupt, but *before* the main process is asleep ? The > interrupt routine is called immediately after the "UNMASK INTERRUPTS" step > and sends a wakeup signal to the main process. But the main process is not > sleeping yet and thus the wakeup signal is ignored (according to the > documentation S$Wake insures only that the process is running and will *not* > be queued). Then the main process goes to sleep and will remain sleeping for > ever, because the wakeup event has occurred already before it went to sleep. Its been a while since I did much driver level programming in OS9, but I seem to remember wondering how you were supposed to get around this problem. Is the fix you propose > > MUST NOT BE ENABLED BEFORE THE MAIN PROCESS IS ASLEEP. The only way to match > this condition is to call F$Sleep WHILE THE INTERRUPTS ARE STILL DISABLED, > and relying on the F$Sleep itself enabling the interrupts when it is safe. I also legal in OS9? If so, could MicroWare please tell us what other system calls clear the interrupts? (both in OSK and OS9). -- Terry Ingoldsby ctycal!ingoldsb@calgary.UUCP Land Information Systems or The City of Calgary ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb
dibble@cs.rochester.edu (Peter C. Dibble) (08/18/89)
This is in reply to a question about device drivers, in particular the gap between the time interrupts are unmasked and the F$Sleep to wait for an interrupt. The question was, roughly, if an interrupt arives and is serviced before the driver sleeps and after it commits to sleep, isn't that interrupt (actually the signal from the interrupt service routine) lost? === It turns out that the wakeup signal has a special property that makes it queue... sort of. A wakeup signal will either activate the target process or set the signal-pending flag for that process. It activates the process (basically) if the process is sleeping or waiting (either process wait or event wait). The signal pending flag causes the next sleep (or wait) to return immediately. Signal pending is only cleared by sleep or wait. Wakeup has had this property since OS-9/6809. I haven't seen it documented. This is something I should add if I do another edition of Insights. Keeping interrupts masked right into the F$Sleep will work, but it leaves interrupts masked for a _long_ time. It'll cause serious performance problems on systems that rely on fast interrupt response. Peter
davidb@braa.inmos.co.uk (David Boreham) (08/21/89)
Yes ! I too have always wondered about this. I am doing a driver for hardware which almost invariably interrupts *before* the process sleeps. I would have thought that the F$sleep system call aught to unmask the interrupt, otherwise, as the previous poster says, the interrupt will be taken *before* the process is actually sleeping. Consequently, the signal from the "bottom-half" of the driver will be lost forever. I can't believe that this problem is real, but I can't see what the answer is. (Also, my driver screws-up in a way consistent with the problem--this proves nothing however:) . David Boreham, INMOS Limited | mail(uk): davidb@inmos.co.uk or ukc!inmos!davidb Bristol, England | (us): uunet!inmos-c!davidb +44 454 616616 ex 543 | Internet : @col.hp.com:davidb@inmos-c