[comp.sys.dec] bug report etiquette

ggs@ulysses.homer.nj.att.com (Griff Smith) (02/24/88)

In article <2338@umd5.umd.edu>, chris@trantor.umd.edu.UUCP writes:
> >In article <2323@umd5.umd.edu> I asked:
> >>Is the 4.3BSD [TU78] driver wrong once or twice?
> 
> It appears I insulted the driver unnecessarily.  I have yet to find
> a bug here, but...
> 
> In article <10102@ulysses.homer.nj.att.com> ggs@ulysses.homer.nj.att.com
> (Griff Smith) writes:
> >The driver, as released in 4.3BSD, does have a few bugs - not in error
> >recovery that I know of.
> 
> ... I spotted a nice bug in mtustart...
> -- 
> In-Real-Life: Chris Torek, Univ of MD Computer Science, +1 301 454 7163

If you had looked in your archives of comp.bugs.4bsd you would have
found the report of this "nice bug"; I sent it to Berkeley on August
11, 1987 and also filed to netnews.  That bug was introduced by someone
at Berkeley sometime between 4.3 beta and 4.3 official.  I also
reported a race in the "open" code - I will take responsibility for not
having noticed it when I overhauled the driver.

I don't want to start a war, but I am a bit miffed about this exchange
of notes about the TU78 driver.  I tried to do a careful job of quality
testing, added comments to the code to explain my assumptions, went
through a year's grief getting approval from my management to release
the code, and then followed up with more bug reports when further
problems (some beyond my control) appeared after the official release.
I also left a mail address in the source code in case problems were
discovered.

It would have been an act of simple courtesy to have asked me in
private communication first.  Not only would it have saved net
bandwidth, but the problem would probably have been diagnosed faster
and both of us would have avoided the embarrassment of a public
shouting match.  I particularly resent having my sentence "I posted
fixes to comp.bugs.4bsd for the bugs that I found" removed from the
above followup to my followup.  This twisted the reply to say "your
driver does too have bugs - look at this juicy one".  I don't think I
should have felt obliged to re-post the bug reports just to preemt this
kind of jab.

I will try to continue to follow a policy of using private mail when
questioning network articles.  After an embarrassing exchange with
Chris a few years ago, I think I have learned my lesson about public
posting.  It can be frustrating, however.  A recent private exchange
with Chris about difficulties taking his advice to port 4.3BSD "dump"
to Sun work stations broke off with a comment that could be paraphrased
as "this is left as a trivial exercise to the reader".

Chris, I respect your ability.  You have made valuable contributions to
the UNIX System software environment.  How about giving the rest of us
mortals some credit for intelligence.
-- 
Griff Smith	AT&T (Bell Laboratories), Murray Hill
Phone:		1-201-582-7736
UUCP:		{allegra|ihnp4}!ulysses!ggs
Internet:	ggs@ulysses.att.com

chris@trantor.umd.edu (Chris Torek) (02/24/88)

In article <10110@ulysses.homer.nj.att.com> ggs@ulysses.homer.nj.att.com
(Griff Smith) writes:
>If [I] had looked in your archives of comp.bugs.4bsd you would have
>found the report of this "nice bug"; I sent it to Berkeley on August
>11, 1987 and also filed to netnews.

Unfortunately, my archives (such as they are) are on those tapes we
have been having trouble reading.  You did indeed post such a fix;
I found it elsewhere since.  And as you mention, that bug was introduced
at Berkeley by A. Nonymous anyway.

[much deleted; see the previous article]

>It would have been an act of simple courtesy to have asked me in
>private communication first.

(That presupposes I would know where to ask.)

>...  I particularly resent having my sentence "I posted
>fixes to comp.bugs.4bsd for the bugs that I found" removed from the
>above followup to my followup.  This twisted the reply to say "your
>driver does too have bugs - look at this juicy one".

I did not mean to do that.  (For that matter, I know of no one who
uses cooked /dev/mt devices anyway.  Without a way to set the block
size, and given the repositioning error on 9 track tapes, what good
*are* block tape devices?  They make terrible disk drives.  Hence a
bug in the block code is hardly juicy.)

At any rate, to get to the point (yes, there is one here), I had
actually intended my previous followup as an apology.  I just like
having the last word :-) and was not careful about the wording of
said words.  Consider this another attempt.

By the way, we have concluded that the problem is in hardware.  A
handy nearby 6250 bpi drive is now successfully reading those tapes,
and we have a call in to DEC to get the TU78 fixed.
-- 
In-Real-Life: Chris Torek, Univ of MD Computer Science, +1 301 454 7163
(hiding out on trantor.umd.edu until mimsy is reassembled in its new home)
Domain: chris@mimsy.umd.edu		Path: not easily reachable

nessus@athena.mit.edu (Doug Alan) (02/27/88)

In article <2346@umd5.umd.edu> chris@trantor.umd.edu (Chris Torek) writes:

> (For that matter, I know of no one who uses cooked /dev/mt devices
> anyway.  Without a way to set the block size, and given the
> repositioning error on 9 track tapes, what good *are* block tape
> devices?  They make terrible disk drives.  Hence a bug in the block
> code is hardly juicy.)

I've used block tape devices a lot.  We have many DEC TK50 streaming
tape drives here (one came with every one of a couple hundred VS2's we
received).  The TK50 performs very very very slow and unreliably if it
doesn't get to stream.  The block device is double buffered, while the
raw device is not.  If the raw device is used with the TK50 drive, the
tape drive doesn't stream.  If the block device is used with the TK50
drive, the tape drive does stream, and is much much happier.

|>oug /\lan

jbs@eddie.MIT.EDU (Jeff Siegal) (02/28/88)

In article <3261@bloom-beacon.MIT.EDU> nessus@athena.mit.edu (Doug Alan) writes:
>The TK50 performs very very very slow and unreliably if it
>doesn't get to stream.  The block device is double buffered, while the
>raw device is not.  If the raw device is used [...], the
>tape drive doesn't stream.  If the block device is used [...],
>the tape drive does stream, [...].

In addition to the buffering, the block device forces an abysmally
small block size (as Chris pointed out).  This is a conventional way
to make streaming tape drives stream (by reducing the tape data rate
and density), and also a great way to cripple a tape subsystem.  A
much better way to drive such devices is with raw, asynchronous I/O.
Oh, Unix doesn't do that?  Hmm, I thought there was this other
operating system for VAX's, but I can't seem to remember the name
right now... 

Jeff Siegal

mangler@cit-vax.Caltech.Edu (Don Speck) (03/10/88)

In article <3261@bloom-beacon.MIT.EDU>, nessus@athena.mit.edu (Doug Alan) writes:
>			      If the block device is used with the TK50
> drive, the tape drive does stream, and is much much happier.

Somebody here made a similar observation about TU80's, so he did all
his dumps to the block device.	Sometime later he needed to do a restore,
and all his tapes gave a premature EOF.  Dump had been calculating tape
capacity based on 10240 byte blocks, but the block device was writing
2048 byte blocks and it wouldn't all fit.  The tape driver returned
error, but because block-device writes are asynchronous, the completion
status doesn't get returned to anybody, so he had no indication that
writes were not getting done (except perhaps the long pause at end of
tape).

The block device is for mounting filesystems.  Read only.  Which you'd
probably only want to do if your tape drive is actually a WORM.  Didn't
work correctly in 4.2bsd, though.

Don Speck   speck@vlsi.caltech.edu  {amdahl,ames!elroy}!cit-vax!speck

nessus@athena.mit.edu (Doug Alan) (03/11/88)

In article <5719@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (Don
Speck) writes:

>> [Doug Alan:] If the block device is used with the TK50 drive, the
>> tape drive does stream, and is much much happier.

> Somebody here made a similar observation about TU80's, so he did all
> his dumps to the block device.  [...] Dump [calculated] tape
> capacity based on 10240 byte blocks, but the block device was
> writing 2048 byte blocks and it wouldn't all fit.

Yup, that will happen if you don't know what you are doing.

> [He didn't notice this until later, however, when] he needed to do a
> restore, and all his tapes gave a premature EOF.  [...]  The tape
> driver [had] returned error [when writing the tape], but because
> block-device writes are asynchronous, the completion status doesn't
> get returned to anybody, so he had no indication that writes were
> not getting done (except perhaps the long pause at end of tape).

Well, I don't know what kind of system you are using, but on our
4.3BSD systems, there is no such problem.  'Dump' receives errors such
as these even when writing to the tape asynchronously using the block
device.  I know this for a fact because this very thing happened to me
last night when I made a typo and used the block device when I had
meant to use the raw device on a TU78.  A while later, 'dump' stopped,
complaining that there was a write error 1200 feet into the tape.  The
2400 foot tape on the tape drive, however, was at the end.

There *are* also a few problems using 'restore' on the block device,
but they can also be worked around.  If 'restore' gets an error while
reading the block device, it can't recover from the error and it just
gives up.  What you have to do is use 'dd' to read the tape, telling
it not to stop on errors and to pad any incomplete blocks.  The output
from 'dd', you pipe into 'restore'.

> The block device is for mounting filesystems.  Read only.  Which you'd
> probably only want to do if your tape drive is actually a WORM.  Didn't
> work correctly in 4.2bsd, though.

So you're saying that instead of using the block device to do dumps on
the TK50 and gotten dumps that worked, I should have used the raw
device and gotten dumps that didn't work?  (Using the raw device with
the TK50 results in an order of magnitude increase in time and several
orders of magnitude increase in error-rate.)  Please explain the logic
in that.

|>oug /\lan