jwp@larry.UUCP (Jeffrey W Percival) (12/15/88)
I am running ULTRIX 2.something and wrote a C program to read records off a 6250 BPI tape and write them on stdout. I use read(2) to read the tape records, and pipe the output to some other program. Like this: readtape -f /dev/rmt1h | dowork > results So, this thing is grinding along and I decide to trash the job, so instead of interrupting the process with cntl/C, I pop the drive off-line (thinking the jobs will abort). Later, I find the "readtape" job hanging around with a priority of -5, and I couldn't kill it. I tried putting a tape back on line to see if it was waiting for something, I sent kill signals (IO, HUP, 9) to it to no avail. A reboot finally cleared things up. Now (besides the obvious, cntl/C to the program), what do I do in the future when we have such a hung process? -- Jeff Percival (jwp@larry.sal.wisc.edu)
kai@uicsrd.csrd.uiuc.edu (12/16/88)
> Written by jwp@larry.UUCP: > ... > so instead of interrupting the process with cntl/C, I pop the > drive off-line (thinking the jobs will abort). Later, I find > the "readtape" job hanging around with a priority of -5, and > I couldn't kill it. This problem is not specific to Ultrix. I've found the exact same thing occurs on VAX BSD unix, Sequent Dynix, and Alliant Concentrix. The only thing that seems to work is a reboot. I warned everyone here that interrupting a tape job by putting the drive offline is a tremendous mistake. Also, folks using Ckermit (V057C) to dial out occationally get hung up, with essentially the same thing occuring. The process cannot be killed (even during shutdown, you get the message "something wouldn't die - ps axl advised"), and the device is permanently locked up. I've found that usually, a "kill -HUP", "kill -INT", or "kill -QUIT" will actually get rid of a suspect kermit process, but a "kill -TERM" almost always hangs it (until the next reboot). Is this a problem in the device driver, kernel process management, or something else entirely? It seems that if an event is being waited for, there ought to be a way to have the OS force the event to occur or fail. Patrick Wolfe (pat@kai.com, kailand!pat) System Manager, Kuck and Associates, Inc.
guy@auspex.UUCP (Guy Harris) (12/17/88)
>Now (besides the obvious, cntl/C to the program), what do I >do in the future when we have such a hung process? Reboot. I suspect the problem may be that an interrupt was lost, or something like that, so the drive will appear to be *permanently* in a non-usable state until you reboot, and the process blocked waiting for the drive to finish whatever it was doing will block there forever; since it's waiting at a priority less than PZERO, you can't get rid of it with a signal. Several different kinds of devices exhibit this problem in various flavors of UNIX. In this particular case, there is arguably a bug in the tape driver; I don't know how easy it is to fix.
debra@alice.UUCP (Paul De Bra) (12/17/88)
In article <43200057@uicsrd.csrd.uiuc.edu> kai@uicsrd.csrd.uiuc.edu writes: >... >This problem is not specific to Ultrix. I've found the exact same thing >occurs on VAX BSD unix, Sequent Dynix, and Alliant Concentrix. The only >thing that seems to work is a reboot. I warned everyone here that >interrupting a tape job by putting the drive offline is a tremendous >mistake. >... I have observed exactly the same behavior with Ultrix 2.0 on a Microvax II, where puttinga TK-50 offline could only be recovered from by a reboot. An identical machine with Unix 9Vr2 does not have this problem. So it clearly is something in the Berkeley device driver. Paul. -- ------------------------------------------------------ |debra@research.att.com | uunet!research!debra | ------------------------------------------------------
guy@auspex.UUCP (Guy Harris) (12/17/88)
>Is this a problem in the device driver, kernel process management, or >something else entirely? I consider it ultimately to be a problem in the device driver, unless there is *no* software workaround for the problem. Both in the tape-drive case where an interrupt may get lost, and the serial port case where the "close" is waiting for output to drain, but output has been suspended and the line is sufficiently dead that it's hard to resume it, you can probably always have some timeout in the driver, but you don't want to get screwed if the alarm goes off too soon. >It seems that if an event is being waited for, there ought to be a way >to have the OS force the event to occur or fail. Unfortunately, in general there isn't such a way. I think some people may have stuck a "forced wakeup" system call into their systems (super-user only, I hope!); do a "ps" with the appropriate flags to get the numeric wait channel ID and then run some program that uses that ID in such a call.
gwyn@smoke.BRL.MIL (Doug Gwyn ) (12/17/88)
In article <8555@alice.UUCP> debra@alice.UUCP () writes: >An identical machine with Unix 9Vr2 does not have this problem. So it clearly >is something in the Berkeley device driver. That wouldn't surprise me very much. VAX (and PDP-11) magtape drivers have been pretty horrible in most UNIXes I've seen, partly due to the tendency of most controllers to generate one interrupt when a rewind (for example) command is accepted, and another when BOT is reached. This implies that some state must be maintained in the driver, and you know how notoriously easy it is to get into trouble doing that. It could be much worse -- when I was using a prerelease of VAX/VMS (as a DEC subcontractor) ages ago, I took the TE16 off-line manually while it was rewinding (in order to be sure it wouldn't be overwritten, since it still had the write ring installed), and the WHOLE SYSTEM wedged until the rewind completed. I nearly died laughing..
ka@june.cs.washington.edu (Kenneth Almquist) (12/17/88)
Patrick Wolfe asks: > Is this a problem in the device driver, kernel process management, or > something else entirely? It seems that if an event is being waited for, > there ought to be a way to have the OS force the event to occur or fail. The basic problem is that the device driver should be programmed to time out after a while if it fails to receive an interrupt. But UN*X doesn't make this particularly easy to do, and a lot of device drivers don't bother. Perhaps a variant of the sleep routine with an additional argument specifying a timeout would encourage writers of device drivers to do things right. Kenneth Almquist
chris@mimsy.UUCP (Chris Torek) (12/18/88)
In article <8555@alice.UUCP> debra@alice.UUCP (Paul De Bra) writes: >... Ultrix 2.0 on a Microvax II, where puttinga TK-50 offline could >only be recovered from by a reboot. >An identical machine with Unix 9Vr2 does not have this problem. So >it clearly is something in the Berkeley device driver. Blame where credit is due department :-) : The Ultrix 2.0 TK50 driver is not the same as the `Berkeley' TK50 driver. The `Berkeley' driver was contributed by DEC and is probably a variant of the Ultrix 1.1 or 1.2 driver. One should be grateful to DEC for the existence of the driver at all; but bugs in it are not the fault of someone at Berkeley. In article <9221@smoke.BRL.MIL> gwyn@smoke.BRL.MIL (Doug Gwyn) writes: >That wouldn't surprise me very much. VAX (and PDP-11) magtape drivers >have been pretty horrible in most UNIXes I've seen, partly due to the >tendency of most controllers to generate one interrupt when a rewind >(for example) command is accepted, and another when BOT is reached. >This implies that some state must be maintained in the driver, and >you know how notoriously easy it is to get into trouble doing that. Indeed. In this case, the driver is usually waiting for a DONE interrupt of some kind, and gets an OFFLINE error interrupt instead. It should then clean up and perhaps close the tape drive; user programs should be able recover by closing and reopening the drive, then positioning the tape as appropriate. >It could be much worse -- when I was using a prerelease of VAX/VMS >(as a DEC subcontractor) ages ago, I took the TE16 off-line manually >while it was rewinding (in order to be sure it wouldn't be overwritten, >since it still had the write ring installed), and the WHOLE SYSTEM >wedged until the rewind completed. I nearly died laughing.. This sort of thing is not always the fault of software. Early revisions of the firmware for the Emulex SC41/MS (a UDA50 emulator that speaks SMD II or II+ or IIe or XMD or something---anyway, it plugs into the big CDC 14 inch drives---where was I? Oh yes, SC41/MS firmware) had a bug that would, under high load, hang a VAX-11/750 so utterly that only power cycling, or pushing the little white `reset' button, would bring it back. (All console microcode function, including ^P, was suspended.) I told Emulex of the bug's existence, but they did nothing about it until VMS V4 came out and began exercising it. I think they doubted me :-) . At any rate, the device driver situation is likely to suddenly become better. Read all about it in a future Usenix conference proceedings . . . I hope. (Not San Diego. Maybe Baltimore.) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
henry@utzoo.uucp (Henry Spencer) (12/18/88)
In article <43200057@uicsrd.csrd.uiuc.edu> kai@uicsrd.csrd.uiuc.edu writes: >This problem is not specific to Ultrix. I've found the exact same thing >occurs on VAX BSD unix, Sequent Dynix, and Alliant Concentrix. The only >thing that seems to work is a reboot. I warned everyone here that >interrupting a tape job by putting the drive offline is a tremendous >mistake. Hardware permitting, it is sometimes possible to break a system out of this sort of hang by taking the drive offline, spacing the tape forward, initiating a rewind, and then putting the drive online before the rewind finishes. On at least some drives, this generates the interrupt that the driver is waiting for (although the driver may detect enough anomalies in how this happened to complain). Of course, if your drive won't let you do this without software cooperation, you're sorta stuck... Many, many device drivers unfortunately don't observe a general rule of robustness: unless there is legitimate reason for a device operation to take an unbounded time to complete (e.g. a read from a terminal), drivers should *never* sleep waiting for a device without setting a timeout. This applies to *all* devices, since hardware failures should be handled more gracefully than by just hanging, but devices that can wander off into limbo due to human intervention are particularly important cases. -- "God willing, we will return." | Henry Spencer at U of Toronto Zoology -Eugene Cernan, the Moon, 1972 | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
jbn@glacier.STANFORD.EDU (John B. Nagle) (12/19/88)
Lost interrupts tend to be more of a problem on systems with interrupts as edges rather than levels. PDP11s and VAXen are in the former category, and Motorola M680x0 machines are in the latter. On Motorola iron, and on the buses usually used with it, controllers raise an interrupt line when they want attention, and the interrupt will recur until the controller is made happy. Thus, interrupts tend not to be lost; even if the CPU, bus, or driver loses one, it will recur until serviced. On the other hand, obscure errors in driver interrupt processing in level-triggered interrupt systems tend to result in the system hanging in a tight loop incorrectly servicing the interrupt. In systems with edge-triggered interrupts, one needs a timer to detect lost interrupts. In systems with level-triggered interrupts, one may need a counter to detect ones that won't clear. John Nagle
jfh@rpp386.Dallas.TX.US (The Beach Bum) (12/19/88)
In article <1988Dec18.023931.28730@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >Many, many device drivers unfortunately don't observe a general rule of >robustness: unless there is legitimate reason for a device operation to >take an unbounded time to complete (e.g. a read from a terminal), drivers >should *never* sleep waiting for a device without setting a timeout. Or as a minimum, provide a means of awakening a wedged driver. Perhaps by using a single event and checking for error conditions or completion prior to sleeping or continuing. >This applies to *all* devices, since hardware failures should be handled >more gracefully than by just hanging, but devices that can wander off into >limbo due to human intervention are particularly important cases. And as a suggestion - the inclusion of an ioctl to RESET the device has proven most helpful with a certain vendors hardware which is highly prone to extreme flakyness. -- John F. Haugh II +-Quote of the Week:------------------- VoiceNet: (214) 250-3311 Data: -6272 |"Just remember, if you swap the first InterNet: jfh@rpp386.Dallas.TX.US | and second letters of USENET you get UucpNet : <backbone>!killer!rpp386!jfh +-SUENET." -- J. F. Haugh II------
henry@utzoo.uucp (Henry Spencer) (12/20/88)
In article <17909@glacier.STANFORD.EDU> jbn@glacier.UUCP (John B. Nagle) writes: > Lost interrupts tend to be more of a problem on systems with interrupts >as edges rather than levels. PDP11s and VAXen are in the former category, and >Motorola M680x0 machines are in the latter. On Motorola iron, and on the >buses usually used with it, controllers raise an interrupt line when they >want attention, and the interrupt will recur until the controller is made >happy... Can you explain where you got the idea that the PDP11 (I can't speak for the VAX) does something different? Believe me, having an interrupt recur until the controller is happy is a well-known nuisance on the 11, and as far as I know (I'm not intimate with the Unibus any more), the protocol is entirely level-triggered. The problem being discussed here is not an interrupt being missed, but no interrupt coming in at all -- the software is waiting for an event that may never occur if the tape drive is meddled with by humans. -- "God willing, we will return." | Henry Spencer at U of Toronto Zoology -Eugene Cernan, the Moon, 1972 | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
dave@micropen (David F. Carlson) (12/20/88)
In article <732@auspex.UUCP>, guy@auspex.UUCP (Guy Harris) writes: > >Now (besides the obvious, cntl/C to the program), what do I > >do in the future when we have such a hung process? > > Reboot. I suspect the problem may be that an interrupt was lost, or > since it's waiting at a priority less than PZERO, you can't get rid of > > In this particular case, there is arguably a bug in the tape driver; I > don't know how easy it is to fix. Of course, there is a bug. And by the description, the driver is waiting at less than PZERO too. My suggestion, and yes, in my code requirements, is that *ALL* device drivers have a hard reset ioctl that frees every used resource and resets the state of the hardware to a known good state. There is no excuse for the default action of a driver bug (yes, there will be bugs) hanging a process without any hope of salvation. The code is usually already written in for most drivers that have an init() code. Is there any reason to build a device driver without a hard reset? -- David F. Carlson, Micropen, Inc. micropen!dave@ee.rochester.edu "The faster I go, the behinder I get." --Lewis Carroll
ggs@ulysses.homer.nj.att.com (Griff Smith) (12/21/88)
In article <43200057@uicsrd.csrd.uiuc.edu>, kai@uicsrd.csrd.uiuc.edu writes: > > > Written by jwp@larry.UUCP: > > ... > > so instead of interrupting the process with cntl/C, I pop the > > drive off-line (thinking the jobs will abort). Later, I find > > the "readtape" job hanging around with a priority of -5, and > > I couldn't kill it. > > This problem is not specific to Ultrix. I've found the exact same thing > occurs on VAX BSD unix, Sequent Dynix, and Alliant Concentrix. The only > thing that seems to work is a reboot. It's not just the brand of operating system; specific drives may work better than others. The 4.3BSD device driver for the TU78 tape drive should have no problem with a drive going off-line; the error recovery is implemented to follow the procedures described in DEC's TM78 documentation. I tested the driver with all the nasty cases I could think of, including dropping power to the tape controller while a tape was in motion. The driver survived on a VAX 11/785, but a power problem on a VAX 8650 caused an interrupt loop that required a re-boot. You might try tripping power breakers to see what happens, but not when you aren't willing to take a crash. > Is this a problem in the device driver, kernel process management, or > something else entirely? > there ought to be a way to have the OS force the event to occur or fail. The problem is that people who write tape device drivers often don't give a damn about error recovery. There has also been little interest in defining consistent error recovery semantics, even WITHIN offerings from a single vendor. Some tape drives also make it difficult to deal gracefully with errors. I think 4.4BSD will have a close approximation of reasonable behavior; I'm surprised that Ultrix doesn't yet. > Patrick Wolfe (pat@kai.com, kailand!pat) > System Manager, Kuck and Associates, Inc. -- Griff Smith AT&T (Bell Laboratories), Murray Hill Phone: 1-201-582-7736 UUCP: {most AT&T sites}!ulysses!ggs Internet: ggs@ulysses.att.com
chris@mimsy.UUCP (Chris Torek) (12/21/88)
In article <11027@ulysses.homer.nj.att.com> ggs@ulysses.homer.nj.att.com (Griff Smith) writes: >... a power problem on a VAX 8650 caused an interrupt loop that required a >re-boot. We saw something similar on an 8600. I think the SBIA gets confused. A more general reset scheme may surface someday (if I ever finish this paper...); it may be possible to recover even from this. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
w-colinp@microsoft.UUCP (Colin Plumb) (12/21/88)
In an OS I was doing recently, as we were doing the device interface (calls a device should support... open, close, read, write, ioctl for misc. stuff), I successfully argued that abort should be at this level, to encourage people to use it, and so there is no perverse thing you could do with ioctl that might cause it to block. I't a special case all the way through (since it does slightly nasty things to the transaction in progress). -- -Colin (uunet!microsof!w-colinp)
dave@onfcanim.UUCP (Dave Martindale) (12/21/88)
In article <1988Dec19.215505.3768@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >>On Motorola iron, and on the >>buses usually used with it, controllers raise an interrupt line when they >>want attention, and the interrupt will recur until the controller is made >>happy... > >Can you explain where you got the idea that the PDP11 (I can't speak for >the VAX) does something different? Believe me, having an interrupt recur >until the controller is happy is a well-known nuisance on the 11, and as >far as I know (I'm not intimate with the Unibus any more), the protocol >is entirely level-triggered. The behaviour depends on the interrupt controller in the device. The Unibus itself just handles requests and grants, and a device is free to request over and over again until it's happy. However, the original DEC interrupt controller card contained a flip-flop that was triggered by the rising edge of an interrupt request signal (usually DONE anded with interrupt enable) and cleared when the interrupt had been granted by the Unibus. Later DEC devices, with all the circuitry on a single card, retained this behaviour. Thus, it is common to have a Unibus device driver just handle the information passed back from the device by an interrupt without ever doing anything to change the state of the device. The DONE or READY bit and IENABLE bits remain set, and the software knows that the hardware will not request another interrupt.
guy@auspex.UUCP (Guy Harris) (12/23/88)
>Of course, there is a bug.
Note, though, that the bug may be the lack of a timeout for some
operations, rather than something the driver explicitly does to lose an
interrupt; there may well be hardware botches that permanently lose
completion interrupts.
rbj@nav.icst.nbs.gov (Root Boy Jim) (01/07/89)
? In article <9221@smoke.BRL.MIL> gwyn@smoke.BRL.MIL (Doug Gwyn) writes:
? >This implies that some state must be maintained in the driver, and
? >you know how notoriously easy it is to get into trouble doing that.
I heard that Sun is coming out with NTS a stateless Network Tape System.
Most likely AT&T will come out with RTS which preserves UNIX semantics.
(Root Boy) Jim Cottrell (301) 975-5688
<rbj@nav.icst.nbs.gov> or <rbj@icst-cmr.arpa>
Crackers and Worms -- Breakfast of Champions!