[net.unix-wizards] One Emulex SC41/MS

chris@umcp-cs.UUCP (Chris Torek) (04/10/86)
We bought an Emulex SC41/MS controller and two CDC 9771 drives.
The SC41/MS is an `MSCP compatible' device that emulates a DEC
UDA50.  This article is an anecdotal description of our experiences
thus far with the controller and drives.

When we first obtained the hardware several months ago, we ran into
a few snags.  The University bureacracy had managed to mangle the
order into listing the machine for which the controller was purchased
as a Vax 11/780; in fact, it was a 750, and we needed the Emulex
cassettes to format the drives, but of course because we said `780'
they sent us console floppies.  As it turns out, you can fix this
with `arff':  The following procedure copies a floppy to tape:

				# log in as root on 780,
				# and insert floppy #1
	cd /tmp; mkdir floppy1
	(cd floppy1; arff x)	# extract
	rcp -r floppy1 750:/tmp	# copy to 750
				# repeat for each floppy

				# log in as root on 750
	cd /tmp/floppy1		# go to floppy directory
	arff crmf /dev/tu0 *	# put everything on the tape

Of course, you still need a boot block on a bootable tape; but I
kludged around that by putting the bootable image from the first
floppy on our root file system, and making a companion to /boot
that loaded it.  You could also just copy the boot block from
any other DECtape that has it:

	dd if=/dev/tu0 of=bootblock	# with good tape
	dd if=bootblock of=/dev/tu0	# with new tape

In any case, the formatter worked fine, and after about eight hours,
both drives were formatted and verified.  (I should mention here
that Emulex did indeed send us the proper tapes; I was simply
impatient.)  Now we had nearly 1.35 gigabytes more space on our
machine.  Wonderful!  Now to put it to use . . . so I created file
systems and mounted them, and then the fun began.

After about five minutes, the machine hung.  It was clearly a bug
in---what else?---the UDA50 driver, as interrupts were still working
and CPU bound tasks kept going.  But the moment anything tried to
touch a drive, CDC or DEC RA81, it was blocked.  This was quite
repeatable: with the CDC drives mounted and in use, the machine
would hang within thirty minutes.  `Well,' thought I, `time to fix
the driver.'

Now at the time, we were running a modified version of the RIACS
driver.  For those of you who have not heard of it, this is the
one with dynamic bad block revectoring, so that when your RA81
begins to bobble bits, you need not reformat the entire drive,
with the attendant and painful dump-and-restore sequence.  The
key words describing this driver are `useful', `large', and
`thoroughly unreadable'.

After a few days I gave up the task of fixing the existing driver.
It was long overdue for a rewrite anyway; and I decided that I
should, instead of just fixing it, try my hand at writing a generic
MSCP driver, so that if and when we got a TMSCP tape, it would then
be a simple task to talk to it.  So of course the next step was
the first required when writing any driver: obtain the hardware
documentation.  `No problem!' thought I.  `I shall call DECDirect
and give them the order number straight from the Emulex manual.'
That I did, and this I discovered:  DEC does not sell the MSCP
documentation.  Yes indeed, it does exist; no you cannot get it.

Well, that stumped me for a while.  How can you write a driver
without knowing what it needs to do?  Ah, but wait!  We already
have a driver---nay, in fact, *three* drivers---that probably do
mostly the right things.  To make a long story short, I cannibalised
parts of the RIACS driver, the original 4.2 driver, and the 4.3
beta driver, to put together a completely redone version of my own.
Along the way I found out what all the CPU-dependent code was for,
and I changed the Unibus support code to do BDP allocation `right'.
It took several weeks, but at last I had a driver that booted and
ran.  (It took several more days before it crashed properly---a
bug in the dump code---and it was still more later that it handled
Unibus resets, but it ran!)  I brought the CDC drives on line, and
waited for the driver to hang.  5 minutes . . . 15 . . . an hour,
more . . . *hooray!  It runs!*

Well, at last all our troubles were over.  Right?  Wrong.

A few nights later I went to dump the new file systems from the
CDC drives to tape.  We use a special kernel hack to make dump run
fast, so there I was loading tapes onto the TU80 and watching them
stream at 100 ips.  Well, make that about 75 ips average.  Performance
was not teriffic; but that must be expected with Unibus disk drives,
for the fastest transfer rate achievable on a `real' Unibus is
550K/sec, and of course we had seek delays to deal with as well.
(Incidentally, for those to whom seek time is important, the CDC
drives list an average seek time of 18 ms., and no head switch
delay; compare this with, I think, 31 ms. and a 6 ms. head switch
delay on the RA81.)  Running iostat showed that the top performance
of the CDC drives was actually lower than that of the RA81s:  doing
large raw disk reads, peak performance on the CDC drives was about
350K/sec, while on RA81s it reaches the 550K/sec maximum.  Presumably
Emulex has not properly laid out the sectors rotationally; and
there is no way to change the sectoring:  It is in firmware on the
controller.  Perhaps Emulex will read this and put in a format
parameter in the next version.  ---But so what if the performance
was worse; we needed the disk space.  At least it worked.

Or so I thought.  `DUMP:  NEEDS ATTENTION: ...'  Time to change
tapes again.  Ok, tape number 5, go.  Watch the reels:  ZOOOOOM
forward, blip back, ZOOOOOM, blip, ZOOOOM, blip, blap.  Blap?  Hey,
what gives?  The tape drive has stopped.  Uh, Oh.  Wait, no console
response; must be hung at interrupt level.  Time to get another
crash dump.  Type control P.  I said control P.  *Control P*.
Oboy.  Look at the console lights.  POWER on, ok.  RUN on, ok.
ERROR off, ... off?

I quote from the DEC hardware handbook:

	-------------------------------------------------------------
	*Error* indicator	Lighted red brightly to indicate that
				the CPU is stopped because of an
				unrecoverable, control-store parity
				error.  Because console commands are
				ignored, the *reset* switch mustbe used
				to clear the error.

				Lighted red dimly to indicate that the
				CPU is functioning normally.
	-------------------------------------------------------------

Do *you* see any mention of `off'?

Well, to make another long story short, the machine would hang
quite thoroughy as long as the Emulex controller and the TU80
controller were on the same Unibus.  We moved the TU80 to another
Unibus adapter, so that now the SC41/MS was all by itself on UBA
zero, and the hangs stopped.  (No software changes, of course.)
Also interesting was the fact that with the CPU cabinet open, the
performance of the Emulex card changed.  It ran faster.  With the
cabinet closed it would sometimes slow down so much that the TU80
dropped to 25 ips streaming.  (This makes an enormous difference
in dump times for one CDC drive, from about two hours for a
330-megabytes-used file system up to about six hours.)  With the
TU80 on the other Unibus, that problem went away too.

Since then (it has been about a week) we have had exactly one crash,
this time due to a response packet from the Emulex controller
containing the wrong command reference number.  It should have said
`8009fec8'; but it said `80090000', so all is still not well.  Yet
it only happened once; it could be a kernel bug; we have installed
the kernel RFS from Todd Brunhoff, and we know of at least one bug,
so there may well be others.

Summary:  The controller seems to work, as long as it is on a Unibus
by itself, or at least as long as it does not have to compete
greatly with another controller for Unibus resources.  But you may
want to avoid this particular controller, at least until it has
been exercised a bit longer.  The drives, on the other hand, are
very nice.  It is wonderful to run `df' and see /usr only 64% full,
with another 188 megabytes there alone, and more than 300 megabytes
free on the other drive.  It is too early to guess at reliability,
but there were no bad sectors at all on one of the two drives!
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1415)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu