[net.unix-wizards] UDA50/RA81 problems....

chris@umcp-cs.UUCP (08/12/84)

My understanding was that the ``random offline'' problem was due to
a timing bug in the UDA50 microcode.  If you rave at DEC long enough
they will probably swap ROMs (or even boards) for you.  Or, you could
kludge around with watchdog timers and UBA resets and the UDA init
code and try to force it back on every time it goes off line.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci (301) 454-7690
UUCP:	{seismo,allegra,brl-bmd}!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@maryland

dave@RIACS.ARPA (08/13/84)

From:  "David L. Gehrt" <dave@RIACS.ARPA>

The distributed berkeley drivers I have seen are buggy.  We noticed
poor[er than was reasonable] throughput on our 81's, and another site
here at ames was having a serious random offline problem.  We took a
dynamic look at some of the data structures in the driver, and
discovered that under heavy load, the controller was being flooded with
Get Unit Status (M_OP_GTUNT) commands.  If you look into the driver you
will see a block of code in udstart() which begins with an "if ((i =
ubasetup(..." and ends by sending a Get Unit Status command.  The
effect of this block of code is the flooding to which I referred above.
Removing this behavior clears up a lot of the problems if not all, but
we made enough changes that the context diffs are about the same size
as the driver.

The driver we are currently running works just fine, and has dumpcode
(which the distributed code lacked). We haven't gotten around to adding
support for more than one device type at a time so our ra60 is not yet
installed. The other site here started running the driver and its
serious random offline problem went away.  There are a number of sites
which have picked up the code for our driver and none have reported
back any problems as of this writing.  Neither of our sites had any
microcode upgrades but the legend is that early versions microcode
caused all sorts of problems. We have seen a number of modified drivers
all of which look like they would solve the problem.

We have had plans to add bad block forwarding to our driver for six
months, and have received some code which will advance that effort.
I'll report any successes in this location.  The problem with the
effort has been lack of time and lack a source for reliable information
in support of the activity, which brings me to a...

Minor Flame:  After all the time in the field with this hardware (we
have had our ra81s for amost a year), I am more than a little
dissappointed at the small amount of reliable information on the
uda50/ra?? combination in the hands of the DEC field service folks and
the users, and with the large amount of misinformation and legend we
all seem to be given. Here are a couple of legends I think are or were
wide spread and false:

	1.  "UN*X (TM) scribbles all over the rct (replacement caching
	tables) used for bad block forwarding."  [Not in *any* UN*X
	driver I have seen.]
	
	2.  "The controller forwards bad blocks automatically."  [I have
	seen nothing that indicates that this is true, and lots of bad 
	block reports to indicate that the controller is not forwarding
	them.  In VMS for example the host  seems to initiate all
	bad block forwarding].

Flame off.  

Because the devices are new and, except for a couple of little
problems, have been reliable, and quick, the fact that the bsd
distributed drivers I have seen are not correct is very troublesome.
Also, it is beginning to look like the users of these devices need to
establish their own communications path to diseminate information on
the devices and their drivers.  Dec has a clear interest in not
disclosing too much about the protocol used and other technical details
to keep out the competition, but judging from the number of pieces of
mail  here which start with some variation on the theme "Help with
UDA50/RA81 problems!!!" It is clear that there is a need to improve the
information flow. So here is a start.  I have a 4.2 driver which works
fine.  [There is no way I know of to determine if it is completely
correct, or the most efficient implementation.]  Also, I know of a site
with a working 4.1 driver, and I will try to get a copy of the diffs
for redistribution if there is sufficient interest.

I now relinquish the soap box, but I do feel better.

dave
----------

rbbb@RICE.ARPA (08/15/84)

From:  David Chase <rbbb@RICE.ARPA>

To remove some of the mystery (not all):

1) UDA "microcode" bugs:
Check your UDA boards - if they are M7161 and M7162, then they are OLD;
if they are M7485 and M7486, then they are NEW.  I don't think there are
many old boards out there anymore, since DEC (at least in our part of the
world) went around upgrading the disks on some sort of schedule.  We may
have unusually responsible field service out here, since everyone else tells
horror stories.  Whatever version of the driver we are running (for 4.2)
doesn't knock the disk offline; uda.c claims it is revision 2.1 84/03/05,
and has the unfortunate comment "TO DO: write the bad block forwarding code".

2) Information about these devices can be had from DEC; here are the order
numbers and the address:

EK-UDA50-UG-002 UDA50 User Guide (mostly hardware info)
AA-L619A-TK     MSCP Basic Disk Functions Manual
AA-L620A-TK     Storage System Diagnostic and Utilities Protocol
AA-L621A-TK     Storage System UNIBUS Port Description

I have the first manual, but not the other three.  The last three may be
ordered as a kit,

QP905-GZ        UDA50 Programmer's Documentation Kit.

The address is:

Software Distribution Center
Order Adminstration/Processing
20 Forbes Road (NR4)
Northboro, MA 01532

3) Deuna information (lots of it) EK-DEUNA-UG-001 Deuna User's Guide.  Why
anyone would use a Deuna when Interlan boards are available is beyond me,
since the Deuna draws about twice as much of everything from the Unibus, and
prefers the official DEC H4000 transceiver.  Xerox makes one about as big as
my fist that seems to work with the Deuna and its diagnostics, except that
it lacks the H4000's bogus "heartbeat" (the transceiver asserts "collision"
in a special window to let the controller know that its collision detector
is still working.)

For high density applications ethernet, I recommend DEC's DELNI.  It
provides 8 connections for a single network tap.  It can also operate
without any ethernet connection (providing a cheap 8 node psuedo-ethernet)
and (if not connected to ethernet) can be tiered to support up to 64 nodes.
Cable length restrictions would probably make a 64 node DELNI network a
little silly, but it is possible.  We have 5 diskless Suns connected to a
net through one of these, and have had no trouble from the DELNI.  I also
recommend this because we have had significant (more than once) problems
with bad connections to the ethernet cable itself (sometimes shorting the
cable), and people using the network get unhappy.

4) 750 hardware information (this might solve some of the WCS questions,
though not how to deal with the DEC-supplied updates), EK-KA750-TD-002 (not
necessarily the latest edition).  This is NOT for the faint of heart.

Now, does anyone out there know any good rumors about "fast fork" for 4.2+n?
This uses copy-on-write shared memory; we once heard that this would require
a microcode update and was thus delayed.  I didn't understand that rumor,
since it seems doable with software.  Any comments?

5) There is a TM78 (the TU78/TA78 formatter) microcode upgrade floating
around; it doesn't break the 4.2 driver (it changed EOT processing in some
way, I think to report EOT before any io errors; this helps VMS backup not
embarrass itself by running off the end of the tape).  We also received this
upgrade on some schedule, I think determined by our drive serial number.

Hope this clears up some of the hardware confusion out there.

drc

andrew@hwcs.UUCP (Andrew Stewart) (08/16/84)

We are running a VAX-11/750 under 4.2BSD with an RA81 driven from a UDA50;
like a number of other sites, we have experienced the ra81/uda50 going
offline for no apparent reason. Does *anyone* know what causes this?
Is there a cure?
Is it a problem in the uda50? The ra81? The driver software?
Is it (as I suspect) a UBA timing window problem?
Any pointers or ideas would be welcomed. I will, as usual, summarise.

Andrew Stewart,
Dept. of Computer Science, Heriot-Watt University, Edinburgh.

eric@milo.UUCP (08/16/84)

	I would like to thank dave for clearing up a mystery
that has plagued me for some time. We have three 11/780s, all with
RA81s. After 6 months with no problems, one of them decided to go
berserk, occasionally going offline, etc. The other two continued
to perform flawlessly. Sounds like hardware, right? DEC replaced all
the controller boards, all the drives, the memory, and most of the cpu,
with no success. Finally, we installed a different driver, which
ostensibly only allowed support for ra60s, no mention of change to
how the drive was handled. Lo and behold, the problem went away.
(I should mention that the first driver was apparently acquired from within
DEC, the second, correct, one came over the net. Just goes to show who
you should trust). Anyway, I went back and checked, and sure enough, the
second driver does not issue the Get_Unit_Status command. Now, there are
still some un-answered questions, such as why that particular machine
started having problems, since it is not the most heavily loaded system,
and we tried swapping things all over the two unibuses to try and minimize
the possibility of unibus contention being the problem. Also, once the
problem started appearing, it got to the point where the system would fail
with only a few people on, mostly idle. Anyway, thanks again for clearing
up why the problem go fixed.

	On a side note, I would like to mention that the local DEC people
knew next to nothing about the drive and MSCP in general (in fact, I have
all of the "official" documentation that DEC gives them - it is hand written
explanations of some of the more common error codes), but to be fair, DEC
did fly in an expert to meet with us who was knowledgable about the drive.
He also was not able to isolate the problem, but it does seem to be a subtle
one. Anyone know if Ultrix has a correct driver?

-- 
					eric
					...!seismo!umcp-cs!aplvax!eric

henry@utzoo.UUCP (Henry Spencer) (08/18/84)

> Now, does anyone out there know any good rumors about "fast fork" for 4.2+n?
> This uses copy-on-write shared memory; we once heard that this would require
> a microcode update and was thus delayed.  I didn't understand that rumor,
> since it seems doable with software.  Any comments?

What I heard was that the 750 has a microcode bug that prevents copy-on-write
from working properly, this being one of the reasons why fast fork has been
so long in coming.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

dmmartindale@watcgl.UUCP (Dave Martindale) (08/18/84)

Alex White, who worked on the UDA50 driver here, theorized that the
"Get Unit Status" botch was written into the driver at a time
when the UDA50 could queue only 15 outstanding requests, which happens
to be the same as the number of BDP's available on the 780 UBA.
If all of the BDP's are in use, and if you don't use them for anything
other than the UDA50, then the UDA can't handle another request anyway
right now, and sending out the Get Unit Status really just provides you
with a way to get an interrupt at the point that the UDA50 finishes
one of the transfers, coincidentally freeing up a BDP.

This strategy doesn't work if the UDA50 can handle more than 15 requests
(new UDA50's do 22) or if you have some other device using one of the
BDP's.  In this case, you get constant interrupts.

chris@umcp-cs.UUCP (08/21/84)

First:  Please post diff's to the 4.2 UDA driver.

Second:

The reason for the Get Unit Status command in the first place (the one
which floods the UBA and UDA with interrupts) is because *something*
has to be done at that point, and I suppose the author of the original
code felt that M_OP_GTUNT was the least drastic.  Here's the scenario:

	N requests for Unibus BDPs all granted.  (N depends on CPU
	type.)

	UDA50 requests BDP.  Request is denied because the UBA is out
	of BDPs.  The driver can't wait for one because this is
	happening at interrupt level.  What to do?

Solution 1: return from the interrupt code, without doing anything at
all.  This would work except for one snag:  what if there are no
transfers pending on that controller?  No more interrupts will occur,
and we'll never get another shot at grabbing a BDP and starting the
transfer.  (Another thing would need to be done is allocate the MSCP
packet *after* getting the BDP, not before.  Unless you know a way
to give back an MSCP packet . . . ?)

Solution 2: do a Get Unit Status.  That doesn't need a BDP and can use
the handy MSCP packet that's already been allocated.  Unfortunately, it
floods the UBA with interrupts until a BDP is finally released.

Solution 3 (my favourite but requires hacking the UBA code):  Return
from the interrupt code after exacting a promise from the UBA code to
call the interrupt routine again once a BDP is free.  Requires some
sort of queueing, alas.  Another possibility is to just set a flag
someplace and have a callout daemon (the udwatch watchdog routine that
isn't there for some reason, perhaps) call the interrupt routine.
Easier but not as aesthetic.

Solution 4: apply Solution 1 after moving all other devices that use
BDPs to another Unibus.  That would guarantee that the snag never
occurs.  However, while technically feasible, it may not be a viable
option.  (I can just see trying to justify another UBA to the state
``because we can't write the software right''. . . .)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci (301) 454-7690
UUCP:	{seismo,allegra,brl-bmd}!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@maryland