[comp.bugs.4bsd] Solution to: Cant access disks on second UDA50

steve@dartvax.UUCP (Steve Campbell) (08/10/87)

In article <6683@dartvax.UUCP> I wrote:

>Although conventional wisdom says not to put more than 1 UDA50 per
>unibus, we are trying to do just that.  We have added a second UDA50 to
>the bus and a third-party device called a USI/HRS from a company named
>Shitashi which claims to enhance the unibus bandwidth enough to permit
>the second uda.  The other devices on the unibus are a DEUNA and 2 DZ11s.
>
>For testing purposes, we moved 2 RA81s from uda0 to uda1, so we have...
>
>controller	uda0	at uba0 csr 0172150		vector udintr
>disk		ra0	at uda0 drive 0
>disk		ra1	at uda0 drive 1
>controller	uda1	at uba0 csr 0172550		vector udintr
>disk		ra2	at uda1 drive 2
>disk		ra3	at uda1 drive 3
>
>As far as we can tell, the hardware is working just fine.  
>But ...  if we do a large number of accesses to files on any disk
>USING PATHNAMES, then do a sync, the 2 disks on the second uda cannot
>be accessed, and the command - and terminal - trying to do so hangs
>completely.

Several people suggested that the configuration specification was the
problem; others said no, the config is OK.  Ed Gould posted a nice mini-
dissertation on how configuration names are mapped.  Conclusion: the 
configuration is OK.

Other people suggested adjusting the time delay jumper on the UDA50.
[BTW, beware of a typo in the DEC UDA50 Users Manual table that tells
how to set that jumper. The pins are mislabeled.]  Someone else
suggested changing UDABURST in the driver.  Conclusion: these
adjustments no doubt affect performance, but they were not the cause of
my problem.

Scott Bradner (harvisr!sob) pointed me in the right direction:

> the 4.3 uda driver has a bug that causes the drives on a 2nd controller
> to appear to go off line under load, any processes that are accessing those
> drives will hang forever.

Jean Huens (kulcs!jean) got closer:

> I got similar problems on a microvax.  We have there (on a Q-bus) an
> RQDX (+- uda compatible : same driver) from DEC and a second RQDX
> compatible controller (Sigma) with an Fujitsu Eagle.  Ocassionaly
> processes got hung waiting for the fujitsu.  The problem was that the
> controller was idle (without outstanding commands) But there were still
> request from Unix waiting (looks like interrupt lost or race
> condition).  I looked in the uda driver from Ultrix 1.2 and saw they
> start there a timer which calls the udastart routine regularly. (once a
> minute) This cured the problem with the disk.

Jean sent me that modified driver.  I installed it and ran my standard
test that would hang the system.  It hung as always... but as soon as
the timer that Jean mentions went off, the hung command completed normally.
It was spooky, as though there was a little gremlin in there that got poked
every minute or so and un-jammed things.  Now this jerky operation of the
system was not good enough for production work, but it seems to clinch what
was causing the problem.  I would like to hear an explanation from someone
who knows the hardware well.  Which leads me to...

The final solution to the problem.  In one posting Chris Torek wrote:

> What makes Steve's problem particularly perplexing is that everything
> works at least a little bit.  The machine finds the controllers
> and drives, and can talk to them a bit, e.g., with raw I/O.  Raw
> transfers do not really work the I/O system very hard, though, so
> I suspect some sort of hardware glitch with `simultaneous' transfers.
> 
> (My first suggestion, of course, was to try my driver....)

Well, I hate a smart aleck, especially one who turns out to be right.  I
tried Chris's driver, and it solves the problem.  The new configuration
works as well as the old.  Very nice work, Chris.

Chris's driver prints some identification information at boottime,
including the following from my machine:
	Aug  9 12:34:56 libdev vmunix: uda0: version 5 model 6
	...
	Aug  9 12:34:56 libdev vmunix: uda1: version 4 model 6
Is that different version number significant to this problem?

In conclusion, thanks to all for the help, and especially to Chris Torek
for the new driver that doesn't have the bug.  And how about a fix from
Berkeley for their standard driver?

					Steve Campbell
					Dartmouth College

chris@mimsy.UUCP (Chris Torek) (08/11/87)

In article <6842@dartvax.UUCP> steve@dartvax.UUCP (Steve Campbell) writes:
>Scott Bradner (harvisr!sob) pointed me in the right direction:
>>the 4.3 uda driver has a bug that causes the drives on a 2nd controller
>>to appear to go off line under load, any processes that are accessing those
>>drives will hang forever.

I know nothing of this bug.  The 4.3BSD driver does have a `feature'
which irritates a microcode bug in some UDA50s, causing the controller
itself to hang.  This is rare, and current UDA50s do not exhibit
the bug at all unless you have a 785 or 8600.  Controller hangs
are distinguished by the light patterns on one of the two modules
in the Unibus box: one of the LEDs stops blinking.

There is another bug in 8600s that loses UDA50 interrupts under
heavy interrupt load (we get it while using the 4.3BSD rdump on
Sun 3s).  I do not understand the details, but my driver recovers
eventually (at least if you have all your UDA50s on the same
Unibus!---there is a bug in the reset code in mscp.c).

>Jean Huens (kulcs!jean) got closer:
>>...  I looked in the uda driver from Ultrix 1.2 and saw they
>>start there a timer which calls the udastart routine regularly.

Ugh!

>In one posting Chris Torek wrote:
>>(My first suggestion, of course, was to try my driver....)
>Well, I hate a smart aleck, especially one who turns out to be right.

Shall I make a point of being wrong on occasion? :-)

>Chris's driver prints some identification information at boottime,
>including the following from my machine:
>	Aug  9 12:34:56 libdev vmunix: uda0: version 5 model 6
>	...
>	Aug  9 12:34:56 libdev vmunix: uda1: version 4 model 6
>Is that different version number significant to this problem?

I am not sure what is different between versions 4 and 5; version
3 still exhibits the Get Unit Status hang bug on 780s.  We still
have some version 4 controllers here, and they work fine.  Of course
I *am* using my driver....
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
Domain:	chris@mimsy.umd.edu	Path:	seismo!mimsy!chris