[comp.sys.3b1] Lots of weird hangs reported - could be a 3.51m bug?

pschmidt@athena.mit.edu (Peter H. Schmidt) (02/07/91)

Recently a spate of postings has appeared detailing several peoples' problems
with processes dying (e.g. smgr), slowing down (wmgr), and machines hanging.
Since my machine has displayed all these symptoms, I am eager to get to the
bottom of this.  To be brief, I think we've hit on a bug in 3.51m.

winter: 3b1, 2M RAM, ICUS 2 disk, disks in external cabinet, WD2010, 3.51m,
floppy tape, periodic 9600baud uucp on tty000, getty on ph1.

My own problem always has the same basic structure: some background program
gets weird, like smgr stopping the processing of cron jobs; then getty's on
tty000 and ph1 become unable to answer the phone; a few minutes later, the
system clock freezes, and at this point all I can do is mouse on the windows -
typing is never echoed, and the SHFT-SUSP hotkey takes ~2 minutes to change
windows.  At this point, I have to hit the reset button, wherupon winter comes
right up and works dandy.  A look at the uucp logs and the cron logs then lets
me reconstruct (in part) how the failure happened.

Unfortunately, this behavior is maddeningly inconsistent.  It is not related
to disk I/O, or power supply voltages.  The programs fail in random order, and
it can happen in as little as 12 hours, or after over a week.  Sometimes, just
for variety, I get a panic, but never the same one.  Before 3.51m, I would get
what now seems like the same behavior, but at intervals of months.  However, I
can't say for sure that the increased frequency started with 3.51m.  It has
only gotten really bad in the past 4 months.

I tried de-dust-bunnying, tweaking the PS voltages upwards, and shutting down
my uucp polling.  Nothing has helped. The diagnostics pass flawlessly (don't
they always?).  I haven't changed the system software since I installed 3.51m.
MeterMaid always shows ample clists and serial buffers, and a decent amount of
free pages.

I wouldn't have wasted this net.bandwidth if it didn't appear from others'
postings that I may not be alone.  Anybody else out there have problems like
this, or ideas on how to fix it?

Regards -- Peter, the Often Rebooting
--
Peter H. Schmidt	| ...mit-eddie!winter!pschmidt
3 Colonial Village, #10	| winter!pschmidt@mit-eddie.mit.edu
Arlington, MA  02174	| -- Speaking for myself.

wilkes@mips.COM (John Wilkes) (02/07/91)

I used to have the occasional odd hang and other strange occurences.  Then
I got a UPS.  System stays up for months at a time, and then comes down
usually because of some pilot error or the power going away for longer than
the UPS stays up.  I've been running 3.51m since about a week after it was
available, and I've not noticed any problems with the machine up 24 hours a
day.  However, the machine is mostly a personal convenience, and it is
lightly used (i.e., not running netnews, no other regular users, low uucp
traffic, etc.)  I'll have to pay more attention to window switching time,
though.

My personal theory is that the 3b1's power supply is sensitive to the power
coming from the line.  The $500 I spent on a UPS is one of the best
investments I've ever made.

-wilkes  <wilkes@mips.com>

thad@btr.btr.com (02/07/91)

wilkes@mips.COM (John Wilkes) in <45628@mips.mips.COM> writes:

	I used to have the occasional odd hang and other strange occurences.
	Then I got a UPS.  System stays up for months at a time, and then
	comes down usually because of some pilot error or the power going away
	for longer than the UPS stays up.  I've been running 3.51m since about
	a week after it was available, and I've not noticed any problems with
	the machine up 24 hours a day.
	[...]
	My personal theory is that the 3b1's power supply is sensitive to the
	power coming from the line.  The $500 I spent on a UPS is one of the
	best investments I've ever made.

I second John's comments without hesitation.  From personal experience I can
state that unconditioned AC line power is ALL computers' nemesis.

I've posted (volumes) on this subject before, and have rented an AC power line
monitor to see 2000V spikes, hash, and other garbage on the power lines caused
by refrigerators, flourescent lamps, drill motors, etc. create "weird" and
"inexplicable" system problems.

At the MINIMUM, one needs an "adequate" surge and transient suppressor for a
computer.  And, I can say this without hesitation or fear of retribution or
lawsuit (because the technical facts attest and substantiate my comment): the
stuff from Radio Shack is JUNK and you don't want that stuff anywhere near your
computer.

After putting out bids and personally examining devices and reading the test
reports in various magazines, the surge and transient protectors I have on my
systems are those from GTE (cringe, yes, I know the 3B1 bears an AT&T logo :-).
But, since 1985, I have not had a SINGLE AC-power-related problem with ANY of
my computers, and I have a LOT of systems running here.  Prior to using the GTE
devices, I would get numerous floppy errors each week (and you don't want to
know the HD travails).

At this point in time, ALL my computers, modems and LANs are serviced by a
SAFE "UPS", and everything else (printers, plotters, phone lines, etc.) is
protected by GTE AC surge/transient suppressors.  When you consider that my
systems remain up for 6 months at a time with NO (repeat, NO) problems, that
should be sufficient testimony; and the only reason I mention "6 months" is
because that's the interval I've chosen for powering-down and cleaning the
fans and guts.

Thad Floryan [ thad@btr.com (OR) {decwrl, mips, fernwood}!btr!thad ]

bruce@sonyd1.Broadcast.Sony.COM (Bruce Lilly) (02/08/91)

In article <1991Feb6.220445.8640@athena.mit.edu> pschmidt@athena.mit.edu (Peter H. Schmidt) writes:
>  To be brief, I think we've hit on a bug in 3.51m.
>
>My own problem always has the same basic structure: some background program
>gets weird, like smgr stopping the processing of cron jobs; then getty's on
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I've seen this once.

>tty000 and ph1 become unable to answer the phone; a few minutes later, the
>system clock freezes, and at this point all I can do is mouse on the windows -
>typing is never echoed, and the SHFT-SUSP hotkey takes ~2 minutes to change

In cases where no typing echos and things really slow down, it's usually
because all of the clists are full (second bar in first group goes to
zero). This has always been due to a jabbering quasi-modem -- yanking the
plug on it restores it and the computer to sanity.

>Unfortunately, this behavior is maddeningly inconsistent.  It is not related
>to disk I/O, or power supply voltages.  The programs fail in random order, and
>it can happen in as little as 12 hours, or after over a week.  Sometimes, just
>for variety, I get a panic, but never the same one.  Before 3.51m, I would get

I've had quite a few panics, also.

>what now seems like the same behavior, but at intervals of months.  However, I
>can't say for sure that the increased frequency started with 3.51m.  It has
>only gotten really bad in the past 4 months.

I'd say unhesitatingly that things got noticeably worse after
``upgrading'' from 3.51 to 3.51m.  But I don't want to go back
because of the metermaid and a few other added features.
When I ran 3.51, systems would often stay up and running for 4-5
months. Now I'm lucky if things stay up for 2 months, and I
occasionally have to restart a dead or comatose process.

>MeterMaid always shows ample clists and serial buffers, and a decent amount of
                                         ^^^^^^^^^^^^^^
I've never seen this dip below about 95% -- I wish it were
configurable so I could use the space elsewhere.

>this, or ideas on how to fix it?

I guess we just have to wait for fixdisk 3 :-).

-- 
    Bruce Lilly, Product Manager,      | bruce@Broadcast.Sony.COM
    Digital Television Tape Recording, | uunet!sonyusa!sonyd1!bruce
    Sony, 3 Paragon Drive, Montvale,   | lilb@sony.compuserve.com (slow)
    NJ 07645-1735  |  Telephone: 1(201)358-4161  |  FAX: 1(201)358-4089