[unix-pc.general] Unix-PC crashing during uucico

elliot@alfred.UUCP (Elliot Dierksen) (01/11/90)

I have been having a very annoying problem lately. While my system (3.51a
4MB RAM, HDB uucp, 2224CE0 modem) is talking to my main news feed (386 PC w
internal hayes comp. modem). It crashes. I come in and DTR is still being
sent to the modem. I hit a key to try and unblank the screen and nothing
happens. However, it always comes right back up when I hit reset. It only
seems to happen when they call me, not when I call them. I don't know if it
is just a coincidence that it is always the same system. It is my primary
news feed, and we move around 1 MB of data a day. The wierd thing is that it
isn't consistent. Sometimes I'll go a week with no problems, and this week
it has carashed 3 or 4 times??   Help!!   

EBD
-- 
Elliot Dierksen        "I don't care if my lettuce has DDT on it,
                        as long as it's crisp!!" -- Jorma Kaukonen
Work) {att,codas}!candi!fang!ebd                      (407) 660-3377
Home) {peora,uunet,ucf-cs}!tarpit!alfred!elliot       (407) 290-9744

wtm@neoucom.UUCP (Bill Mayhew) (01/14/90)

We get random crashes on our 3b1, neoucom, which serves as our
gateway to the uucp domain.  Running the stock version 2, we had
more crashes than we get with the HDB system.  The bugs seem to
happen if you have more than one uucp running at a time, such as
ph1 and tty000 together.  The port driver seems to get confused
while handling the 7201 interrupts and leaves a boluxed address in
one of the 68010 registers.  Supposedly this problem was fixed in
3.51d (which it would appear is only available inside of AT&T at
this moment).  With HDB, you can set Maxuuxqts to 1, which it
seems, should prevent multiple uucicos from running.

None the less, we get an occasional crash every week or two.  That
isn't too awful, considering the 3b1 handles 2 or more megabytes of
news and mail on busy days.  If that news gets bottled up, it can
be a problem, so I built a hardware solution.  I looked through my
junk box and came up with a 6502, 6522, 2716, 6116, and a 16
character LCD display.  I put together a very small computer that
watches the tty000 port on the 3b1.  If my box doesn't see any
activity on the tty port for more than 70 minutes, it turns off a
solid state relay for 30 seconds to cut power to the 3b1, forcing a
reboot.  We set up Poll to make sure that uucico runs once an hour.
The 6502 program is only about 1K of assembly level code, and most
of that is for running the LCD display.  I also wrote in support
for input from an overtemp thermistor, but we've been too lazy to
disassemble the 3b1 to mount a termistor on the power supply.  I
experimented with measuring the temperature of the air coming out
of the fan grille, but variations in the room temperature were too
great to differentiate between fan failure and ambient variation.
I also ruled out using an air pressure sensor as too unrliable as
well.  On our 3b1, the air leaving the fan is only about 5 degrees
C above ambient.  I suppose one could take a differential
temperature measurement, but laziness kept me from getting that
fancy....


Bill

bdb@becker.UUCP (Bruce Becker) (01/16/90)

In article <1871@neoucom.UUCP> wtm@neoucom.UUCP (Bill Mayhew) writes:
|
|We get random crashes on our 3b1, neoucom, which serves as our
|gateway to the uucp domain.  Running the stock version 2, we had
|more crashes than we get with the HDB system.  The bugs seem to
|happen if you have more than one uucp running at a time, such as
|ph1 and tty000 together.  The port driver seems to get confused
|while handling the 7201 interrupts and leaves a boluxed address in
|one of the 68010 registers.  Supposedly this problem was fixed in
|3.51d (which it would appear is only available inside of AT&T at
|this moment).  With HDB, you can set Maxuuxqts to 1, which it
|seems, should prevent multiple uucicos from running.

	I run HDB with multiple uuxqt's, multiple
	uucico's, & all sorts of other stuff at
	the same time - it *never* crashes. I
	tend to reboot every couple of months
	just to dust off the memory chips, but
	it's just being cautious.

	On the other hand, I don't use the OBM,
	because it's a pretty flaky device.

	If you aren't using the internal modem, then
	you've probably got some hardware problem.
	My system used to act up once in a while,
	but it was just a case of reseating some
	chips & cables inside - things have worked
	just fine ever since...

-- 
  ,,,,	 Bruce Becker	Toronto, Ont.
w \$$/	 Internet: bdb@becker.UUCP, bruce@gpu.utcs.toronto.edu
 `/c/-e	 BitNet:   BECKER@HUMBER.BITNET
_/  >_	 "Money is the root of all money" - Adam

rhl@eci386.uucp (Richard Lathwell) (01/17/90)

In article <2277@becker.UUCP> bdb@becker.UUCP (Bruce Becker) writes:
> In article <1871@neoucom.UUCP> wtm@neoucom.UUCP (Bill Mayhew) writes:
> |We get random crashes on our 3b1, neoucom ...
>
> 	uucico's, & all sorts of other stuff at
> 	the same time - it *never* crashes. I
> 	tend to reboot every couple of months
> 	just to dust off the memory chips, but
> 	On the other hand, I don't use the OBM,
> 	because it's a pretty flaky device.

At ECI, we have 3 3b1s: 2 are connected to our gateway (named gate in
the maps) by starlan.  gate has 3.5 meg memory (via a fully populated
Combi card), and three rs232 ports. The built-in rs232 port (tty000)
drives an Apple LaserWriter, the other two are two bidirectional connections
to our 386 that doesn't understand starlan (but the 386 supports all 8
terminals in the office). Gate's On Board Modem is directly wired to Bell
through our AT&T PDS; the 386 has a modem on one of its built-in ports
that it shares with a fax machine.

An IBM PC with a starlan card is in the net and uses all of the 3b1s
as both file servers and print servers. The 386 has an rs232 connection
to another 3b1 (named schiz because it has a DOS-73 coprocessor
that also *never* crashes) as an alternate route to the LaserWriter
via starlan to gate.

Gate's OBM handles about 150 calls per day and has done so for about
three years. When gate crashes (about once a month) we usually find
evidence of a kernel bug - "page fault in kernel", corrupted /etc/wtmp
on a block boundary, etc.

Re: *never*: Schiz made it 207 days before the electronics on its
disk fried. We've replaced the fans on all of the 3b1s as they've
died - overheating because of a dead fan hasn't fried a 3b1 yet -
they go into a reboot cycle: (Ouch! I'm hot! Power off...
Power on ... Reboot Ouch! I'm hot! Power off ... ad infinitum
until someone comes into the office and hears the poor sucker calling
Help! Help!, says "WTF?", turns off the power switch, and after letting
it cool down and running diagnostics, replaces the fan).

In other words, In ECI's collective experience, they *never* fail.
The OBM works fine - it's the path of choice to ECI.

We're running a mixture of HDB and AT&T (Convergent Technologies?)
versions of uucp (and cu, etc.). They both seem to work and I've never
seen an unrecoverable failure ascribed to either.

-- 
RHL                                              - rhl@eci386

jcm@mtune.ATT.COM (John McMillan) (02/02/90)

In article <1990Jan31.170216.27161@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes:
:
>Well thanks a whole hell of a lot:

	Your attitude is contageous.

>	1) This isn't RS232, this is OBM - different driver.  The only
>	   thing I have on RS232 is a 300 baud diablo printer and the 3b1
>	   has *never* crashed during printing.

	Reconsider.  Only the phone-line polling and call setup
	software differ.  Your OBM is fed from the On Board RS232
	[sans line-drivers]!  Their shared hardware was so
	intertwined it took years to find some insidious the
	bugs resulting from the sharing.
	
	[Sorry, much as we'd like, we can't take credit for writing
	the original drivers: we were too busy planning to screw up
	your phone network calls -- ref: below.]

>	2) I've been asking this question every other month for quite a bit
>	   longer than you've been posting to this newsgroup, and have *never*
>	   gotten any response other than vague references to the power supply.
>	   (It can't be the powersupply, because the machine has been completely
>	   replaced and is still exhibiting the same problem at the same
>	   frequency, *AND* none of our other 6 machines have ever panicked
>	   in this fashion - some of which UUCP more per day than ecicrl
>	   does.  They're all the same version of the O/S.  So, it must be
>	   something environmental).

	Well... this shatters me.  While I've worked hard at answering
	questions though lacking adequate data, I've a ways to go.
	'Seems to me this is the 1st time I've caught references to
	your DUART activity.  [OK, I admit it: I stopped curling up
	with your notes at night, and I don't recall them all.]

:
>	5) It's worthy of note that people are *still* suggesting power supply
>	   problems - in particular, people as knowledgable as John Milton...
>	   So it ain't all that well known.

	As I stated: the problem is well understood and well documented.
	In the "brief" time -- compared to your contributions --
	that I've been posting to this group, I've described the
	problem several times.  I've also described it in technical
	conversations and memos within AT&T.

	If you present "crashes during DUART activites" to a sober,
	knowledgable 3B1-support person, they should recognize the
	strong possibility of illegally interleaved command sequences
	to a DUART chip.

	I'm *NOT* saying your problem *IS* the DUART problem: just
	that these are the symptoms of it.  I would ALSO consider
	power-supply problems and noise problems -- I would NOT be
	running this machine without Ruby(tm) [or analogous] line-
	conditioners.  I would NOT be running this with an ancient
	kernel that has not benefitted from at least the 3.51 fixes.
	
>	6) It's worthy of note that several very knowledgable people in AT&T
>	   have been consulted (outside of normal support channels) and *no*
>	   concrete suggestions or suspicions have ever been expressed.

	I'm hardly privy to the details you presented them, but
	I'll take your word they were ultra-knowledgable.  It's
	entirely possible that the Tier-II & Tier-IV staff I've
	explained the problem to have deliberately kept this matter
	a secret.  [After all, THEY don't get a chance to nobble
	the phone network as often as they'd LIKE!  So... they
	take it out in other directions.]

>	7) You might want to ask Lenny what happened to our 3.51 upgrade...
>	   (Though it's not his fault...)

	OK -- 'Fess up, Lenny.  We know you've got it and we're
	showing up tonight to take it or you're gonna regret it!

	Actually, in -- am I right Lenny? -- hundreds [seems like
	thousands] of communications with Lenny, he's failed to
	mention your upgrade.  [You sly bugger, Lenny.]

>American Telephone and Telegraph: one might make similar suggestions regarding
>your company's inability to keep the long distance telephone network running...

	Cute.  Trite, simplistic, and irrelevant... but cute!

>Chris Lewis, Elegant Communications Inc, {uunet!attcan,utzoo}!lsuc!eci386!clewis
>Ferret mailing list: eci386!ferret-list, psroff mailing list: eci386!psroff-list

john mcmillan -- att!mtune!jcm

rjg@nis.mn.org (Robert J. Granvin) (02/03/90)

In article <287@mtune.ATT.COM> jcm@mtune.ATT.COM (John McMillan) writes:
>In article <1990Jan31.170216.27161@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes:
>
>>	7) You might want to ask Lenny what happened to our 3.51 upgrade...
>>	   (Though it's not his fault...)
>
>	OK -- 'Fess up, Lenny.  We know you've got it and we're
>	showing up tonight to take it or you're gonna regret it!
>
>	Actually, in -- am I right Lenny? -- hundreds [seems like
>	thousands] of communications with Lenny, he's failed to
>	mention your upgrade.  [You sly bugger, Lenny.]

Really!  Have you guys been holding out on me?  Geez!  After I offered
you a beer and pizza, John... You know these guys for years, and then
you learn something like this... I'm crushed!

I guess I'm not going to name my first born male child Lenny John now.
:-(

>>American Telephone and Telegraph: one might make similar suggestions regarding
>>your company's inability to keep the long distance telephone network running...
>
>	Cute.  Trite, simplistic, and irrelevant... but cute!

Actually, I was very impressed by the high quality and efficiency of
the software work that was done to bring this about.  Very impressive,
very complete and very bullet proof!  The efficiency and speed of such 
a large scale networked application to perform line busying definately
is a showcase of a product designed to handle so many different
conditions in such a rapid fashion.  I hope to strive to those levels
someday...

Oh, by the way, obligatory :-)'s all over the place for the sarcastically
or humor impaired.

-- 
                                          _________Robert J. Granvin_________
duckint: a non-floating duck.             INTERNET: rjg@nis.mn.org
                                            BITNET: rjg%nis.mn.org@nic.mr.net
                                              UUCP: ...amdahl!bungia!nis!rjg

lenny@icus.islp.ny.us (Lenny Tropiano) (02/04/90)

In article <287@mtune.ATT.COM> jcm@mtune.ATT.COM (John McMillan) writes:
|>In article <1990Jan31.170216.27161@eci386.uucp> clewis@eci386.UUCP 
(Chris Lewis) writes:

[... mucho left out here ...]
|>
|>>	7) You might want to ask Lenny what happened to our 3.51 upgrade...
|>>	   (Though it's not his fault...)
|>
|>	OK -- 'Fess up, Lenny.  We know you've got it and we're
|>	showing up tonight to take it or you're gonna regret it!
|>
|>	Actually, in -- am I right Lenny? -- hundreds [seems like
|>	thousands] of communications with Lenny, he's failed to
|>	mention your upgrade.  [You sly bugger, Lenny.]
|>
100% correct, John.  The upgrade Chris is talking about is beyond me.  In fact,
I'm not sure what I'm missing here.  All I know is _most_ of the serious 
problems that were "correctable" were addressed in the 3.51m kernel.  We
have a lot to thank for that.  AT&T could have easily canned any efforts 
for supporting software after they MD'd [manufacturer discontinued] the product.
If it wasn't for a few dedicated people on the net, and good cooperation
from the 3B1-hackers at AT&T, we'd still be looking at the bugs in the
3.51a kernel and other utilities found within the FIXDISK 2.0.

Let's be thankful for that folks.

-Lenny
-- 
| Lenny Tropiano            ICUS Software Systems      lenny@icus.islp.ny.us |
| {ames,pacbell,decuac,hombre,sbcs,attctc}!icus!lenny     attmail!icus!lenny |
+------- ICUS Software Systems -- PO Box 1;  Islip Terrace, NY  11752 -------+