[comp.unix.wizards] Ultrix tape job is unkillable!

jwp@larry.UUCP (Jeffrey W Percival) (12/15/88)

I am running ULTRIX 2.something and wrote a C program to
read records off a 6250 BPI tape and write them on stdout.
I use read(2) to read the tape records, and pipe the output
to some other program.  Like this:

	readtape -f /dev/rmt1h | dowork > results

So, this thing is grinding along and I decide to trash the job,
so instead of interrupting the process with cntl/C, I pop the
drive off-line (thinking the jobs will abort).  Later, I find
the "readtape" job hanging around with a priority of -5, and
I couldn't kill it.  I tried putting a tape back on line to
see if it was waiting for something, I sent kill signals (IO, HUP, 9)
to it to no avail.  A reboot finally cleared things up.

Now (besides the obvious, cntl/C to the program), what do I
do in the future when we have such a hung process?
-- 
Jeff Percival (jwp@larry.sal.wisc.edu)

kai@uicsrd.csrd.uiuc.edu (12/16/88)

> Written by jwp@larry.UUCP:
> ...
> so instead of interrupting the process with cntl/C, I pop the
> drive off-line (thinking the jobs will abort).  Later, I find
> the "readtape" job hanging around with a priority of -5, and
> I couldn't kill it.

This problem is not specific to Ultrix.  I've found the exact same thing
occurs on VAX BSD unix, Sequent Dynix, and Alliant Concentrix.  The only
thing that seems to work is a reboot.  I warned everyone here that
interrupting a tape job by putting the drive offline is a tremendous
mistake.

Also, folks using Ckermit (V057C) to dial out occationally get hung up, with
essentially the same thing occuring.  The process cannot be killed (even
during shutdown, you get the message "something wouldn't die - ps axl
advised"), and the device is permanently locked up.  I've found that usually,
a "kill -HUP", "kill -INT", or "kill -QUIT" will actually get rid of a
suspect kermit process, but a "kill -TERM" almost always hangs it (until the
next reboot).

Is this a problem in the device driver, kernel process management, or
something else entirely?  It seems that if an event is being waited for,
there ought to be a way to have the OS force the event to occur or fail.

Patrick Wolfe  (pat@kai.com, kailand!pat)
System Manager, Kuck and Associates, Inc.

guy@auspex.UUCP (Guy Harris) (12/17/88)

>Now (besides the obvious, cntl/C to the program), what do I
>do in the future when we have such a hung process?

Reboot.  I suspect the problem may be that an interrupt was lost, or
something like that, so the drive will appear to be *permanently* in a
non-usable state until you reboot, and the process blocked waiting for
the drive to finish whatever it was doing will block there forever;
since it's waiting at a priority less than PZERO, you can't get rid of
it with a signal.  Several different kinds of devices exhibit this
problem in various flavors of UNIX.

In this particular case, there is arguably a bug in the tape driver; I
don't know how easy it is to fix.

debra@alice.UUCP (Paul De Bra) (12/17/88)

In article <43200057@uicsrd.csrd.uiuc.edu> kai@uicsrd.csrd.uiuc.edu writes:
>...
>This problem is not specific to Ultrix.  I've found the exact same thing
>occurs on VAX BSD unix, Sequent Dynix, and Alliant Concentrix.  The only
>thing that seems to work is a reboot.  I warned everyone here that
>interrupting a tape job by putting the drive offline is a tremendous
>mistake.
>...

I have observed exactly the same behavior with Ultrix 2.0 on a Microvax II,
where puttinga TK-50 offline could only be recovered from by a reboot.

An identical machine with Unix 9Vr2 does not have this problem. So it clearly
is something in the Berkeley device driver.

Paul.
-- 
------------------------------------------------------
|debra@research.att.com   | uunet!research!debra     |
------------------------------------------------------

guy@auspex.UUCP (Guy Harris) (12/17/88)

>Is this a problem in the device driver, kernel process management, or
>something else entirely?

I consider it ultimately to be a problem in the device driver, unless
there is *no* software workaround for the problem.  Both in the
tape-drive case where an interrupt may get lost, and the serial port
case where the "close" is waiting for output to drain, but output has
been suspended and the line is sufficiently dead that it's hard to
resume it, you can probably always have some timeout in the driver, but
you don't want to get screwed if the alarm goes off too soon. 

>It seems that if an event is being waited for, there ought to be a way
>to have the OS force the event to occur or fail.

Unfortunately, in general there isn't such a way.  I think some people
may have stuck a "forced wakeup" system call into their systems
(super-user only, I hope!); do a "ps" with the appropriate flags to get
the numeric wait channel ID and then run some program that uses that ID
in such a call.

gwyn@smoke.BRL.MIL (Doug Gwyn ) (12/17/88)

In article <8555@alice.UUCP> debra@alice.UUCP () writes:
>An identical machine with Unix 9Vr2 does not have this problem. So it clearly
>is something in the Berkeley device driver.

That wouldn't surprise me very much.  VAX (and PDP-11) magtape drivers
have been pretty horrible in most UNIXes I've seen, partly due to the
tendency of most controllers to generate one interrupt when a rewind
(for example) command is accepted, and another when BOT is reached.
This implies that some state must be maintained in the driver, and
you know how notoriously easy it is to get into trouble doing that.

It could be much worse -- when I was using a prerelease of VAX/VMS
(as a DEC subcontractor) ages ago, I took the TE16 off-line manually
while it was rewinding (in order to be sure it wouldn't be overwritten,
since it still had the write ring installed), and the WHOLE SYSTEM
wedged until the rewind completed.  I nearly died laughing..

ka@june.cs.washington.edu (Kenneth Almquist) (12/17/88)

Patrick Wolfe asks:
> Is this a problem in the device driver, kernel process management, or
> something else entirely?  It seems that if an event is being waited for,
> there ought to be a way to have the OS force the event to occur or fail.

The basic problem is that the device driver should be programmed to time
out after a while if it fails to receive an interrupt.  But UN*X doesn't
make this particularly easy to do, and a lot of device drivers don't
bother.  Perhaps a variant of the sleep routine with an additional
argument specifying a timeout would encourage writers of device drivers
to do things right.
				Kenneth Almquist

chris@mimsy.UUCP (Chris Torek) (12/18/88)

In article <8555@alice.UUCP> debra@alice.UUCP (Paul De Bra) writes:
>... Ultrix 2.0 on a Microvax II, where puttinga TK-50 offline could
>only be recovered from by a reboot.
>An identical machine with Unix 9Vr2 does not have this problem. So
>it clearly is something in the Berkeley device driver.

Blame where credit is due department :-) :

The Ultrix 2.0 TK50 driver is not the same as the `Berkeley' TK50
driver.  The `Berkeley' driver was contributed by DEC and is probably a
variant of the Ultrix 1.1 or 1.2 driver.  One should be grateful to DEC
for the existence of the driver at all; but bugs in it are not the
fault of someone at Berkeley.

In article <9221@smoke.BRL.MIL> gwyn@smoke.BRL.MIL (Doug Gwyn) writes:
>That wouldn't surprise me very much.  VAX (and PDP-11) magtape drivers
>have been pretty horrible in most UNIXes I've seen, partly due to the
>tendency of most controllers to generate one interrupt when a rewind
>(for example) command is accepted, and another when BOT is reached.
>This implies that some state must be maintained in the driver, and
>you know how notoriously easy it is to get into trouble doing that.

Indeed.  In this case, the driver is usually waiting for a DONE
interrupt of some kind, and gets an OFFLINE error interrupt instead.
It should then clean up and perhaps close the tape drive; user programs
should be able recover by closing and reopening the drive, then
positioning the tape as appropriate.

>It could be much worse -- when I was using a prerelease of VAX/VMS
>(as a DEC subcontractor) ages ago, I took the TE16 off-line manually
>while it was rewinding (in order to be sure it wouldn't be overwritten,
>since it still had the write ring installed), and the WHOLE SYSTEM
>wedged until the rewind completed.  I nearly died laughing..

This sort of thing is not always the fault of software.  Early
revisions of the firmware for the Emulex SC41/MS (a UDA50 emulator that
speaks SMD II or II+ or IIe or XMD or something---anyway, it plugs into
the big CDC 14 inch drives---where was I?  Oh yes, SC41/MS firmware)
had a bug that would, under high load, hang a VAX-11/750 so utterly
that only power cycling, or pushing the little white `reset' button,
would bring it back.  (All console microcode function, including ^P,
was suspended.)  I told Emulex of the bug's existence, but they
did nothing about it until VMS V4 came out and began exercising it.
I think they doubted me :-) .

At any rate, the device driver situation is likely to suddenly become
better.  Read all about it in a future Usenix conference proceedings
. . . I hope.  (Not San Diego.  Maybe Baltimore.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

henry@utzoo.uucp (Henry Spencer) (12/18/88)

In article <43200057@uicsrd.csrd.uiuc.edu> kai@uicsrd.csrd.uiuc.edu writes:
>This problem is not specific to Ultrix.  I've found the exact same thing
>occurs on VAX BSD unix, Sequent Dynix, and Alliant Concentrix.  The only
>thing that seems to work is a reboot.  I warned everyone here that
>interrupting a tape job by putting the drive offline is a tremendous
>mistake.

Hardware permitting, it is sometimes possible to break a system out of
this sort of hang by taking the drive offline, spacing the tape forward,
initiating a rewind, and then putting the drive online before the rewind
finishes.  On at least some drives, this generates the interrupt that the
driver is waiting for (although the driver may detect enough anomalies in
how this happened to complain).  Of course, if your drive won't let you
do this without software cooperation, you're sorta stuck...

Many, many device drivers unfortunately don't observe a general rule of
robustness:  unless there is legitimate reason for a device operation to
take an unbounded time to complete (e.g. a read from a terminal), drivers
should *never* sleep waiting for a device without setting a timeout.
This applies to *all* devices, since hardware failures should be handled
more gracefully than by just hanging, but devices that can wander off into
limbo due to human intervention are particularly important cases.
-- 
"God willing, we will return." |     Henry Spencer at U of Toronto Zoology
-Eugene Cernan, the Moon, 1972 | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

jbn@glacier.STANFORD.EDU (John B. Nagle) (12/19/88)

      Lost interrupts tend to be more of a problem on systems with interrupts
as edges rather than levels.  PDP11s and VAXen are in the former category, and
Motorola M680x0 machines are in the latter.  On Motorola iron, and on the
buses usually used with it, controllers raise an interrupt line when they
want attention, and the interrupt will recur until the controller is made
happy.  Thus, interrupts tend not to be lost; even if the CPU, bus, or
driver loses one, it will recur until serviced.

      On the other hand, obscure errors in driver interrupt processing in
level-triggered interrupt systems tend to result in the system hanging
in a tight loop incorrectly servicing the interrupt.  In systems with
edge-triggered interrupts, one needs a timer to detect lost interrupts.
In systems with level-triggered interrupts, one may need a counter to
detect ones that won't clear.

					John Nagle

jfh@rpp386.Dallas.TX.US (The Beach Bum) (12/19/88)

In article <1988Dec18.023931.28730@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>Many, many device drivers unfortunately don't observe a general rule of
>robustness:  unless there is legitimate reason for a device operation to
>take an unbounded time to complete (e.g. a read from a terminal), drivers
>should *never* sleep waiting for a device without setting a timeout.

Or as a minimum, provide a means of awakening a wedged driver.  Perhaps
by using a single event and checking for error conditions or completion
prior to sleeping or continuing.

>This applies to *all* devices, since hardware failures should be handled
>more gracefully than by just hanging, but devices that can wander off into
>limbo due to human intervention are particularly important cases.

And as a suggestion - the inclusion of an ioctl to RESET the device
has proven most helpful with a certain vendors hardware which is highly
prone to extreme flakyness.
-- 
John F. Haugh II                        +-Quote of the Week:-------------------
VoiceNet: (214) 250-3311   Data: -6272  |"Just remember, if you swap the first
InterNet: jfh@rpp386.Dallas.TX.US       | and second letters of USENET you get
UucpNet : <backbone>!killer!rpp386!jfh  +-SUENET."      -- J. F. Haugh II------

henry@utzoo.uucp (Henry Spencer) (12/20/88)

In article <17909@glacier.STANFORD.EDU> jbn@glacier.UUCP (John B. Nagle) writes:
>      Lost interrupts tend to be more of a problem on systems with interrupts
>as edges rather than levels.  PDP11s and VAXen are in the former category, and
>Motorola M680x0 machines are in the latter.  On Motorola iron, and on the
>buses usually used with it, controllers raise an interrupt line when they
>want attention, and the interrupt will recur until the controller is made
>happy...

Can you explain where you got the idea that the PDP11 (I can't speak for
the VAX) does something different?  Believe me, having an interrupt recur
until the controller is happy is a well-known nuisance on the 11, and as
far as I know (I'm not intimate with the Unibus any more), the protocol
is entirely level-triggered.

The problem being discussed here is not an interrupt being missed, but no
interrupt coming in at all -- the software is waiting for an event that
may never occur if the tape drive is meddled with by humans.
-- 
"God willing, we will return." |     Henry Spencer at U of Toronto Zoology
-Eugene Cernan, the Moon, 1972 | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

dave@micropen (David F. Carlson) (12/20/88)

In article <732@auspex.UUCP>, guy@auspex.UUCP (Guy Harris) writes:
> >Now (besides the obvious, cntl/C to the program), what do I
> >do in the future when we have such a hung process?
> 
> Reboot.  I suspect the problem may be that an interrupt was lost, or
> since it's waiting at a priority less than PZERO, you can't get rid of
> 
> In this particular case, there is arguably a bug in the tape driver; I
> don't know how easy it is to fix. 

Of course, there is a bug.  And by the description, the driver is waiting
at less than PZERO too.

My suggestion, and yes, in my code requirements, is that *ALL* device drivers
have a hard reset ioctl that frees every used resource and resets the state
of the hardware to a known good state.  There is no excuse for the default
action of a driver bug (yes, there will be bugs) hanging a process without
any hope of salvation.  The code is usually already written in for most drivers
that have an init() code.

Is there any reason to build a device driver without a hard reset?

-- 
David F. Carlson, Micropen, Inc.
micropen!dave@ee.rochester.edu

"The faster I go, the behinder I get." --Lewis Carroll

ggs@ulysses.homer.nj.att.com (Griff Smith) (12/21/88)

In article <43200057@uicsrd.csrd.uiuc.edu>, kai@uicsrd.csrd.uiuc.edu writes:
> 
> > Written by jwp@larry.UUCP:
> > ...
> > so instead of interrupting the process with cntl/C, I pop the
> > drive off-line (thinking the jobs will abort).  Later, I find
> > the "readtape" job hanging around with a priority of -5, and
> > I couldn't kill it.
> 
> This problem is not specific to Ultrix.  I've found the exact same thing
> occurs on VAX BSD unix, Sequent Dynix, and Alliant Concentrix.  The only
> thing that seems to work is a reboot.

It's not just the brand of operating system; specific drives may work
better than others.  The 4.3BSD device driver for the TU78 tape drive
should have no problem with a drive going off-line; the error recovery
is implemented to follow the procedures described in DEC's TM78
documentation.  I tested the driver with all the nasty cases I could
think of, including dropping power to the tape controller while a tape
was in motion.  The driver survived on a VAX 11/785, but a power
problem on a VAX 8650 caused an interrupt loop that required a
re-boot.  You might try tripping power breakers to see what happens,
but not when you aren't willing to take a crash.

> Is this a problem in the device driver, kernel process management, or
> something else entirely?
> there ought to be a way to have the OS force the event to occur or fail.

The problem is that people who write tape device drivers often don't
give a damn about error recovery.  There has also been little interest
in defining consistent error recovery semantics, even WITHIN offerings
from a single vendor.  Some tape drives also make it difficult to deal
gracefully with errors.  I think 4.4BSD will have a close approximation
of reasonable behavior; I'm surprised that Ultrix doesn't yet.

> Patrick Wolfe  (pat@kai.com, kailand!pat)
> System Manager, Kuck and Associates, Inc.
-- 
Griff Smith	AT&T (Bell Laboratories), Murray Hill
Phone:		1-201-582-7736
UUCP:		{most AT&T sites}!ulysses!ggs
Internet:	ggs@ulysses.att.com

chris@mimsy.UUCP (Chris Torek) (12/21/88)

In article <11027@ulysses.homer.nj.att.com> ggs@ulysses.homer.nj.att.com
(Griff Smith) writes:
>... a power problem on a VAX 8650 caused an interrupt loop that required a
>re-boot.

We saw something similar on an 8600.  I think the SBIA gets confused.
A more general reset scheme may surface someday (if I ever finish this
paper...); it may be possible to recover even from this.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

w-colinp@microsoft.UUCP (Colin Plumb) (12/21/88)

In an OS I was doing recently, as we were doing the device interface
(calls a device should support... open, close, read, write, ioctl for
misc. stuff), I successfully argued that abort should be at this level,
to encourage people to use it, and so there is no perverse thing you
could do with ioctl that might cause it to block.  I't a special case
all the way through (since it does slightly nasty things to the transaction
in progress).
-- 
	-Colin (uunet!microsof!w-colinp)

dave@onfcanim.UUCP (Dave Martindale) (12/21/88)

In article <1988Dec19.215505.3768@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>>On Motorola iron, and on the
>>buses usually used with it, controllers raise an interrupt line when they
>>want attention, and the interrupt will recur until the controller is made
>>happy...
>
>Can you explain where you got the idea that the PDP11 (I can't speak for
>the VAX) does something different?  Believe me, having an interrupt recur
>until the controller is happy is a well-known nuisance on the 11, and as
>far as I know (I'm not intimate with the Unibus any more), the protocol
>is entirely level-triggered.

The behaviour depends on the interrupt controller in the device.
The Unibus itself just handles requests and grants, and a device is free
to request over and over again until it's happy.  However, the original
DEC interrupt controller card contained a flip-flop that was triggered
by the rising edge of an interrupt request signal (usually DONE anded with
interrupt enable) and cleared when the interrupt had been granted by
the Unibus.  Later DEC devices, with all the circuitry on a single card,
retained this behaviour.

Thus, it is common to have a Unibus device driver just handle the information
passed back from the device by an interrupt without ever doing anything
to change the state of the device.  The DONE or READY bit and IENABLE bits
remain set, and the software knows that the hardware will not request
another interrupt.

guy@auspex.UUCP (Guy Harris) (12/23/88)

>Of course, there is a bug.

Note, though, that the bug may be the lack of a timeout for some
operations, rather than something the driver explicitly does to lose an
interrupt; there may well be hardware botches that permanently lose
completion interrupts.

rbj@nav.icst.nbs.gov (Root Boy Jim) (01/07/89)

? In article <9221@smoke.BRL.MIL> gwyn@smoke.BRL.MIL (Doug Gwyn) writes:
? >This implies that some state must be maintained in the driver, and
? >you know how notoriously easy it is to get into trouble doing that.

I heard that Sun is coming out with NTS a stateless Network Tape System.
Most likely AT&T will come out with RTS which preserves UNIX semantics.

	(Root Boy) Jim Cottrell	(301) 975-5688
	<rbj@nav.icst.nbs.gov> or <rbj@icst-cmr.arpa>
	Crackers and Worms -- Breakfast of Champions!