[net.bugs.4bsd] Possible uda50 driver problem

steve@gec-mi-at.co.uk (Steve Lademann) (08/22/86)

Subject: Problem with RIACS uda50 driver
Newsgroups: net.unix,net.bugs.4bsd

HELP!

We are having problems with a VAX 11/750 running 4.2BSD. It all started
when the air-flow sensor got bunged up with a ball of fluff, thus
precipitating several removals of power from the system. Now, it panic
traps with Memory Protection violations (trap type 9) from the uda50
driver in routine 'udrsp' at infrequent intervals, causing considerable
consternation amoung the 15-odd users!

Now, it is probably a hardware fault, but it could be a problem with
the uda50 driver. I will probably need to get stuck in with reams of
assembler/C listings if the engineer can't fix it, but to save time,
does anyone out there know of any problems with the May 1985 version of
the RIACS driver containing Mike Muuss' unit/type detection code, and
the bad block replcement software?

Many thanks in anticipation



-----------------------------------------------------------------
|Steve Lademann         |Phone: 44 727 59292 x326               |
|Marconi Instruments Ltd|UUCP : ...mcvax!ukc!hrc63!miduet!steve |
|St. Albans    AL4 0JN  |NRS  : steve@uk.co.gec-mi-at           |
|Herts.   UK            |                                       |
-----------------------------------------------------------------
|"The views expressed herein do not necessarily reflect"| _____ |
|"those of my employer, and may not even reflect my own"| (   ) |
-----------------------------------------------------------------

chris@umcp-cs.UUCP (Chris Torek) (08/27/86)

In article <189@miduet.gec-mi-at.co.uk> steve@gec-mi-at.co.uk
(Steve Lademann) writes:

>... It all started when the air-flow sensor got bunged up with a
>ball of fluff, thus precipitating several removals of power from
>the system. Now, it panic traps ... from the uda50 driver in routine
>'udrsp' at infrequent intervals....

Strict-typists take heed: lint can create bugs as well as detect
them!  :-)

>Now, it is probably a hardware fault ... to save time, does anyone
>out there know of any problems with the May 1985 version of the RIACS
>driver containing Mike Muuss' unit/type detection code, and the bad
>block replcement software?

Well, yes, but I know of none in udrsp().  Incidentally, I believe
the unit type detection code is by Alex White (watmath!arwhite).

Obligatory Disclaimer:  The following information has not been
officially verified by anyone.  It is all based upon guesswork on
my part, or reverse engineering, if you will.

In particular, this driver---along with every other released UDA50
driver---can occasionally invoke a UDA50 microcode bug, in which
the drives go off line, and the controller stops responding and
must be re-set.  There may be many ways to provoke the problem,
but the primary path concerns rapid-fire Get Unit Status operations.
The RIACS driver, and I think the 4.3BSD driver, try to avoid this,
but can miss if there are other devices on the same Unibus that
use BDPs.  DEC `fixed' the problem in microcode revision 4 ... on
780s and slower machines only, as a CPU conversion later proved.

I have also been told that the bad block forwarding algorithm in
the RIACS driver is based upon, or related to, a flawed algorithm
that was once used in VMS.  I have not looked further into this;
since the last of our RA81 HDAs were upgraded past the `falling
glue' revisions, we have had no real trouble with the HDAs themselves.
The UDA50s, on the other hand ... well, even Emulex has bugs in
their UDA50 emulator (and Emulex finally admitted this: a major
milestone in itself).

It may be a bit early for this, but here goes:  I plan to distribute
a completely rewritten 4.3BSD UDA50/MSCP driver.  As far as I know,
it has no bugs (but then, I said that last week!).  It does not do
dynamic bad block replacement; I now believe that this should be
done outside the driver, at least in the main.  In any case, I have
had no bad blocks to replace, and so I have neither incentive to
write, nor a way to test, such code.  The hooks, however, are there,
in the generic MSCP part of the driver.  Generic: for I have split
the driver into a UDA50-specific portion, and another piece that
deals only with the MSC protocol itself.  I hope that this may be
useful in reducing the size of the TK50/TU81/TA81 driver (a whopping
fifty-seven thousand bytes of C code in the 4.3 distribution).  I
have not looked beyond its copyright notice at the top, as will
no doubt become painfully obvious to anyone who actually tries to
merge my code with DEC's: but the potential is there.

Well, this has become rather long, so perhaps I should say this,
to legitimise the length after the fact (as it were).  If you are
running 4.3BSD, and would like to beta test my driver, let me know.
It will help if you have ARPAnet access, or some other way to
transfer large amounts of data quickly: the code currently totals
some 85K; and that includes neither instructions nor the changes
to the Unibus allocation code.  uda.c alone is 47820 bytes, though
that drops to 28520 bytes when comments are stripped!
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1516)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu

steve@gec-mi-at.co.uk (Steve Lademann) (09/01/86)

Chris,
	Many thanks for input on my uda50 problems. The state at
present is that my annual holiday just happened to coincide with
a visit of our service engineer with a replacement memory controller
and the re-instatement of the original 4.2BSD driver. The problem has not
recurred, so one or the other fixed it. The next problem is to find
sufficient time to isolate the actual cause....

I would be most willing to act as a beta site for your driver, apart
from 2 problems which I believe to be common in this lesser part of
the world called Europe. These are :-

1)	4.3 BSD is not available yet in Europe due to legal problems
	somewhere in U.S. governmental bureaucracy or in University of
	California. (Actual reasons for non-availability seem to vary
	depending on who you speak to - does anyone *really* know?)

2)	I find that large volumes of data are extremely tricky to get
	across the Pond when you are a commercial organisation,
	especially from academic sources. Is this because I don't know
	the right people? (Comments, uk contacts, etc. *very* welcome, 
	especially Arpanet, or whatever it's called nowadays :-)
	I tend to resort to mag. tapes and things in the end. E-mail
	is restricted to 1200 baud PSS or 300 baud dial-up.

With regard to bad block replacement, yes, I would agree that the 
replacement of bad blocks with a separate process makes a lot of
sense. It hopefully reduces the complexity (and hence potential bugginess)
of the driver. However, from what I've seen with regard to the bad block
replacement algorithms, there is a lot of Black Magic involved in the
process anyway. I went through a period of rapid change with regard to HDAs
on our RA81 before they found one which had non-destructive glue (yes, they
put one of the old ones in which had to be changed *again* before it got
its act together!) and this prompted me to do something about the driver.
What will your solution to the problem of getting rid of bad blocks be
(if any)? I must admit that after your comments with regard to the 'iffy'
algorithm in the RIACS driver, mine could well be to call Field Service!

-----------------------------------------------------------------
|Steve Lademann         |Phone: 44 727 59292 x326               |
|Marconi Instruments Ltd|UUCP : ...mcvax!ukc!hrc63!miduet!steve |
|St. Albans    AL4 0JN  |NRS  : steve@uk.co.gec-mi-at           |
|Herts.   UK            |                                       |
-----------------------------------------------------------------
|"The views expressed herein do not necessarily reflect"| _____ |
|"those of my employer, and may not even reflect my own"| (   ) |
-----------------------------------------------------------------

chris@umcp-cs.UUCP (Chris Torek) (09/08/86)

[Some of this is being taken off line.]

In article <194@miduet.gec-mi-at.co.uk> steve@gec-mi-at.co.uk
(Steve Lademann) writes:
>With regard to bad block replacement [on UDA50s] ... from what
>I've seen with regard to the bad block replacement algorithms,
>there is a lot of Black Magic involved in the process anyway.

Not really.  The idea is very straightforward: save the original
data, allocate a replacement block, test it, copy the original data
to it, and mark the original block as forwarded.  The ugliness is
all caused by state saving for error recovery---our old friend the
commit. . . .  The problem gets worse if the forwarding code is
outside the driver.  What happens if the machine crashes during
forwarding?  (If the driver does things with the block half-forwarded,
who knows what might happen.)

>What will your solution to the problem of getting rid of bad blocks be
>(if any)? I must admit that after your comments with regard to the 'iffy'
>algorithm in the RIACS driver, mine could well be to call Field Service!

Well, actually . . . yes.  At least for now.

If you have a copy of Ultrix 1.1 or later, you can run `rabads';
it is a standalone program that just forwards bad sectors.  I have
never used it myself.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1516)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu