[net.bugs.4bsd] UDA50 problems

stan@tikal.UUCP (Stan Tazuma) (12/16/85)

Configuration:
	VAX 11/750
	4.1bsd Unix
	1 UDA50 disk controller, 2 RA81's
	TU78 tape controller with 2 slaves

Problem messages:
	uda0: soft error, disk transfer error, unit 1
	uda0: soft error, SDI error, unit 1, event 0353

	uda0: soft error, disk transfer error, unit 0
	uda0: soft error, SDI error, unit 0, event 0353

    The SDI error message is always paired with the disk
    transfer error message.
    Notice no sector numbers.  The messages occur sometimes 3
    hours apart, sometimes 24 hours apart, but usually not in
    bunches.  Typically, only 2-6 occur per day on a fairly
    active system.

Other:
	Out of 3 identical installations, we get these error
	messages on only one.  The system doesn't crash when
	these messages occur, but peculiar things happen (like
	the autoboot sequence pausing (until a key is hit)
	just before /etc/rc is run (no, we don't have rc prompt
	for anything)).

Anybody have experiences with these errors?
We have had DEC field service on this for weeks (months?),
and they don't seem to know what to do.  We have no manuals on
the UDA50 (the computers are installed at customer sites),
so if anybody out there does, maybe telling us what
event 0353 is would be of some help(?)

Thanks in advance.

spaf@gatech.CSNET (Gene Spafford) (12/18/85)

In article <296@tikal.UUCP> stan@tikal.UUCP (Stan Tazuma) writes:
>Other:
>	Out of 3 identical installations, we get these error
>	messages on only one.  The system doesn't crash when
>	these messages occur, but peculiar things happen (like
>	the autoboot sequence pausing (until a key is hit)
>	just before /etc/rc is run (no, we don't have rc prompt
>	for anything)).
>
>Anybody have experiences with these errors?

I'm afraid I can't help you with your disk errors, but I can tell you
why your system hangs during autoboot.

The printing console you get with the 750 (probably a LA100-type) can't
take characters at the rate the controller is strapped for.  When any
significant amount of output is transferred to the console before the
system comes up, the terminal sends a control S to the system to stop
the output to prevent a buffer overflow.  So far, so good.  The console
catches up and sends a control Q to restart output (the output never
really stops since nothing is paying attention to the console at the
moment).  Unfortunately, since Unix isn't back up, it hasn't issued the
command to fetch the control S out of the console input register.  The
control Q gets tossed away and the overrun bit is set.

Now the system continues on its merry way and suddenly comes up far enough
to poll the console.  Lo, there's a control S (it never checks the
overrun bit)!  It processes the control S and "stops" output.  Eventually,
the buffer fills up and Unix hangs in the reboot code as it tries to
put more characters into the buffer.  Typing any character causes
the driver to start printing again.

Fixes in order of preference:
Setting the console to draft/letter mode button to the setting 
opposite to what you have sometimes works (set it to draft?  I'm
at home at the moment and can't remember which setting is needed).

Change the kernel to either have a bigger console buffer, or check
that the character fetched is a control S and the overrun bit is set,
or that the first character ever fetched is a control S.

Configuring the terminal with the "setup" key to not use control
S/Q  (not good, since this will scramble later output).

Hope that helps!
-- 
Gene "the end is in sight" Spafford
The Clouds Project, School of ICS, Georgia Tech, Atlanta GA 30332-0280
CSNet:	Spaf @ GATech		ARPA:	Spaf%GATech.CSNet @ Relay.CS.NET
uucp:	...!{akgua,decvax,hplabs,ihnp4,linus,seismo,ulysses}!gatech!spaf

roy@phri.UUCP (Roy Smith) (12/19/85)

> The printing console you get with the 750 (probably a LA100-type) can't
> take characters at the rate the controller is strapped for.
> 
> Fixes in order of preference:
> Setting the console to draft/letter mode
> Change the kernel to [...] have a bigger console buffer
> Configuring the terminal with the "setup" key to not use control S/Q

	Why not just change the baud rate from the default 2400 down to
1200?  That's what we did on our 750/LA-120 combo (down with LA-100's!).
The printer can keep up without ever having to send a C-S, and 1200 is
plenty fast enough anyway (in fact, if you make the console 300, that might
be better because it disuades people from using it as a regular terminal).

	The only problem with this is that to change the console baud rate
you have to move jumpers on the CPU backplane.  Why DEC did this, I'll
never know.  BTW, on the 2 Vax installations I've supervised, the tech
doing the install had the baud rate on the console set wrong and refused to
believe that was the problem (they prefer to take the whole CPU apart).
-- 
Roy Smith <allegra!phri!roy>
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (12/21/85)

I would prefer the system to flush all terminal port FIFOs
when it comes fully up.  Nothing in them can be any good
anyway.

neil@man.psy.UUCP (Neil Todd @ UK.AC.MAN.CS.UX) (12/21/85)

In article <296@tikal.UUCP> you write:
>Configuration:
>	VAX 11/750
>	4.1bsd Unix
>	1 UDA50 disk controller, 2 RA81's
>	TU78 tape controller with 2 slaves
>
>Problem messages:
>	uda0: soft error, disk transfer error, unit 1
>	uda0: soft error, SDI error, unit 1, event 0353
>
>	uda0: soft error, disk transfer error, unit 0
>	uda0: soft error, SDI error, unit 0, event 0353
>
>    The SDI error message is always paired with the disk
>    transfer error message.
>    Notice no sector numbers.  The messages occur sometimes 3
>    hours apart, sometimes 24 hours apart, but usually not in
>    bunches.  Typically, only 2-6 occur per day on a fairly
>    active system.
>
>Other:
>	Out of 3 identical installations, we get these error
>	messages on only one.  The system doesn't crash when
>	these messages occur, but peculiar things happen (like
>	the autoboot sequence pausing (until a key is hit)
>	just before /etc/rc is run (no, we don't have rc prompt
>	for anything)).
>
>Anybody have experiences with these errors?
>We have had DEC field service on this for weeks (months?),
>and they don't seem to know what to do.  We have no manuals on
>the UDA50 (the computers are installed at customer sites),
>so if anybody out there does, maybe telling us what
>event 0353 is would be of some help(?)
>
>Thanks in advance.

The event 353 stuff is to do with "drive detected errors"
to quote the MSCP Basic disk functions Manual. I think that
it turns up because the disk has got a bad block. In spite of
what many people think neither the standard driver nor
the hardware will revector the bad block. 

The RIACS uda driver will fix this problem.

Neil Todd

JANET :- neil@uk.ac.man.cs.ux
UUCP  :- ...!mcvax!ukc!man.cs.ux!neil
ARPA  :- neil%uk.ac.man.cs.ux@ucl.cs

P.S. My DEC man didn't have a clue either.