[unix-pc.uucp] 3.51a keeps dying; even HDB is a mess; any ideas out there?

kevin@kosman.UUCP (Kevin O'Gorman) (06/24/88)

I have been having bad results lately, and don't quite know what to blame
it on.  I know there has been a lot of discussion on the unix-pc net lately
about reliability of various configurations, but I have not been paying
enough attention to be aware if there's a consensus.

As an ex-beta test site, I have several versions of the software running
around here, but I am up to rev at the moment with 3.51 OS and utilities,
and 3.51a kernel.  I'm running a pretty recent HDB from THE STORE!

I am distressed because my machine keeps freezing on me.  It seems pretty
clearly related to uucp traffic.  Just about 15 minutes ago, it froze for
the first time while I was watching it.  It was just at the end of an
incoming uucp call (my unix-pc feed, to be exact), and it left something
for cunbatch in /lost+found.  This means that the freezing is costing me
dropped news, which is something I have suspected for some time.

Anyway, in addition to the above, I am running a Telebit Trailblazer Plus
on /dev/tty002 to handle all data calls.  I would even consider getting a
third phone line in here and resurrecting the OBM for some stuff if that
would help.

What I'm looking for is some opinions from you folks who have been paying
attention.  I want to know what configuration is the most reliable.  I can
back off to 3.51 or even 3.50 kernel, I can play with uucp versions, I can
even reactivate the OBM for my 1200 baud traffic if I must (and GTE willing:
I seem to be wired for 6 pairs in this house, thank goodness).

But I want some informed opinions; I know you're out there, there's lots
of talent running around this net.  There ought to be some way to figure
out a configuration that won't keep barfing in the middle of uupc
traffic.

Thanks,


Kevin O'Gorman ( kevin@kosman ) voice: 805-984-8042
  Vital Computer Systems, 5115 Beachcomber, Oxnard, CA  93035

ken@maxepr.UUCP (Ken Brassler) (06/26/88)

In article <422@kosman.UUCP> kevin@kosman.UUCP (Kevin O'Gorman) writes:
>I have been having bad results lately, and don't quite know what to blame
>it on.
>I am distressed because my machine keeps freezing on me.  It seems pretty
>clearly related to uucp traffic.
>It was just at the end of an
>incoming uucp call (my unix-pc feed, to be exact), and it left something
>for cunbatch in /lost+found.

My machine has run continuously and flawlessly for 2 1/2 years. A
few weeks ago a had my first kernel panic, due to a kernel parity
error. I used the reset button to recover.

Later that same day, my machine crashed (froze, locked-up, etc.)
twice while receiving news. The second time, I was using 'rn' at the
same time, and after reset, I found that my .newsrc file now
contained the text of a news article.

Since both crashes occurred while uucico was running, I surmised
that the disk image of uucico had probably been partially
overwritten with garbage during the first kernel panic. I reloaded
new copies of all the uucp executables, uucico, uux, uuxqt, and uucp
from my archives, and the problem immediately disappeared.

Personally, I think that kernel crashes are due to a dram address,
where the kernel is loaded, missing a refresh cycle, or being hit by
a cosmic ray (not completely a joke). If no damage was done to the
files on the hard disk during the crash, you can get by with a
reset. If crashes increase in frequency, I think it's time to
reformat and reload the hard disk.
-- 

Ken Brassler {ihnp4|qantel|pyramid|lll-crg}!pacbell!maxepr!ken

erict@flatline.UUCP (j eric townsend) (07/05/88)

In article <422@kosman.UUCP>, kevin@kosman.UUCP (Kevin O'Gorman) writes:

> I am distressed because my machine keeps freezing on me.  It seems pretty
> clearly related to uucp traffic.  Just about 15 minutes ago, it froze for
> the first time while I was watching it.  It was just at the end of an
> incoming uucp call (my unix-pc feed, to be exact), and it left something
> for cunbatch in /lost+found.  This means that the freezing is costing me
> dropped news, which is something I have suspected for some time.

> What I'm looking for is some opinions from you folks who have been paying
> attention.  I want to know what configuration is the most reliable.  I can
> back off to 3.51 or even 3.50 kernel, I can play with uucp versions, I can


Okedoke.  I'm still running 3.0, and I *never* have uucico/uucp problems.

Ever, ever ever.  Well, I had them twice, actually, but once was
because I was playing around with uucico and friends. Bad idea. :-)

The one problem that I had was setgetty not finishing up %100.  ie:
the LCKs would be rm'd, but uucico wouldn't finish up and exit.  More
uucicos would start throughout the day, and not exit.  So... I'd
have about 8 uucicos, none of them talking on the line, and no LCK
files.
I killed all the uucicos, but that didn't help, the problem started again.
So I rebooted, and that solved the problem.

This is after about 1.5 years of unix-pc uptime and constant usage.

Personally, I'd be %100 content with 3.0 if I could have the 3.5
development set and libraries (flexnames, real curses, etc), but....

An idea:  run one machine as the "newsbox" with 3.0 and HDB (HDB under
3.0 is a godsend) and don't use it for much else.  Rogue, maybe, and
a few other things. :-)
-- 
Skate UNIX or go home, boogie boy...
J. Eric Townsend ->uunet!nuchat!flatline!erict smail:511Parker#2,Hstn,Tx,77007
             ..!bellcore!tness1!/

andy@rbdc.UUCP (Andy Pitts) (07/06/88)

This has all been said before but I'll post it again for any new people
who may have missed it.  The standard things that screw up uucp are:

1) The inittab entry.  There MUST be a space preceding the entry for
the tty line in inittab. example:
 ph1:2:respawn:....
^ must have leading space.  If the space is missing setgetty will lock.

2) There is a bug on some versions of the eia/ram cards that causes the
system to crash.  The fix is simple, you just replace a chip.  If you are
using an eia card call the hotline and ask about it.  This bug only affected
V3.50 and later if my memory serves.

3) Also I seem to remember reading that the device driver for tty000 was
brain damaged.  As I recall it will not pass a null (so break won't change
speed) and hardware flow control did not work.  Some or all of this may
have been fixed with the fixdisk, but I don't know.  Perhaps someone else
out there can tell us.  The drivers for the eia cards seem to work however
(if you change the chip).

I hope this helps.
-- 
Andy Pitts andy@rbdc.UUCP  : "The giant Gorf was hit in  one eye  by a stone,
att    \                   : and that eye  turned  inward  so  that it looked
kd4nc   !gladys!rbdc!andy  : into his mind and he died of what he saw there."
pacbell/                   :   --_The Forgotten Beast of Eld_, McKillip--

rjg@sialis.mn.org (Robert J. Granvin) (07/08/88)

In article <531@rbdc.UUCP> andy@rbdc.UUCP (Andy Pitts) writes:
>This has all been said before but I'll post it again for any new people
>who may have missed it.  The standard things that screw up uucp are:
>
>1) The inittab entry.  There MUST be a space preceding the entry for
>the tty line in inittab. example:
> ph1:2:respawn:....
>^ must have leading space.  If the space is missing setgetty will lock.
>
>2) There is a bug on some versions of the eia/ram cards that causes the
>system to crash.  The fix is simple, you just replace a chip.  If you are
>using an eia card call the hotline and ask about it.  This bug only affected
>V3.50 and later if my memory serves.
>
>3) Also I seem to remember reading that the device driver for tty000 was
>brain damaged.  As I recall it will not pass a null (so break won't change
>speed) and hardware flow control did not work.  Some or all of this may
>have been fixed with the fixdisk, but I don't know.  Perhaps someone else
>out there can tell us.  The drivers for the eia cards seem to work however
>(if you change the chip).

The tty000 driver is broken in at least version 3.51.  It is also
broken in 3.5, if I recall correctly.  The inability for it to pass a
NUL (BREAK) is correct.  This also has the additional added benefit
that if you pass (or attempt to pass) a BREAK through the OBM, you
will, or may, hang any device on tty000.  The device on tty000 will
sometimes, though not always, recover itself later.  3.51a resolves
this problem.

The chip relacement for the EIA/RAM boards work like a charm, but if
you've still got an old one, it's in your best interest to get a
replacement chip NOW.  It's not unlikely that these things will become
very scarce in a short time (much like clock replacement batteries are
today).

Re: other messages about uucico crashing systems, etc:

Hardware Flow Control works, but is broken.  HFC will consistantly
repeat a block of data in an entirely predictable way.  The problem
has been reported to ATT, and has also been sent up a level.  This
escalation in priority (which matches that given to the 3.51a kernel
problems) means that it's very likely (though not guaranteed) that
this bug will be resolved either in the next fixdisk, or a future
one.

3.51 has a broken uucico and ph.  These are repaired by the 3.51a
fixdisk.  3.5 is also broken.

The result is that on completion of _some_ connections (there is
little consistency of which connection will be affected) will cause a
system hang, and often a panic.  You are forced to go for the little
black reset button.

ATT claims that ph is the primary culprit, but experience says that
uucico is really the primary blame, although both is broken.
Primarily based on the number of calls and demands they were getting,
the released the 3.51a uucico on an unannounced request-only basis
several months before the 3.51a fixdisks were released.  In (nearly)
every case, replacing the old uucico with this one completely solved
all panic/crash problems.

Now a step further.  When the 3.51a fixdisk was released, all these
problems were resolved, but a new one was introduced.  If you use the
OBM for UUCP connections, you will _still_ get occasional system
panics with a kernel fault.  If you do _not_ use the OBM, this problem
goes away.  The problem has been directly identified to be in the
3.51a unix kernel.  The bug has been identified as well as verified
that it does not exist in previous kernels.  The problem is caused by
the OBM not correctly closing it's last "physical buffer".  It has
been fixed, and the new kernel (3.51b) is currently in testing.  The
fixdisk does not yet exist, so don't call and ask for it.  I'll post a
note when I know it's available.

The 3.51b fixdisk may also resolve several other issues as well.  The
exact contents are not known.

By the way, re: the OBM.  As some have noticed, the OBM firmware will
handshake very happily as an MNP modem.  While it's fully capable of
handshaking MNP, it certainly does not communicate MNP.  If you're
using a Telebit or other MNP capable modem, you must be sure to turn
off MNP abilities for any connection to a 3b1, or your connection will
fail.  By the way, the OBM dialing _out_ will _not_ handshake MNP, so
it looks like someone originally designed it to be an MNP modem, then
that capability was removed.  Unfortunately, it wasn't removed enough.
ATT was surprised.

-- 
"I've been trying for some time to                           Robert J. Granvin
 develop a life-style that doesn't          National Information Systems, Inc.
 require my presence."                                       rjg@sialis.mn.org
    -Garry Trudeau                ...{{amdahl,hpda}!bungia,rosevax}!sialis!rjg

david@ms.uky.edu (David Herron -- One of the vertebrae) (07/09/88)

In article <635@sialis.mn.org> rjg@sialis.mn.org (Robert J. Granvin) writes:

>By the way, the OBM dialing _out_ will _not_ handshake MNP, so
>it looks like someone originally designed it to be an MNP modem, then
>that capability was removed.  Unfortunately, it wasn't removed enough.
>ATT was surprised.

no, I don't buy that story.  The Unix PC was on the market before
MNP ever was.  MNP is *not* very old!
-- 
<---- David Herron -- The E-Mail guy                         <david@ms.uky.edu>
<---- ska: David le casse\*'      {rutgers,uunet}!ukma!david, david@UKMA.BITNET
<----
<---- Today is the yesterday you worried about tomorrow.

rjg@sialis.mn.org (Robert J. Granvin) (07/10/88)

>>By the way, the OBM dialing _out_ will _not_ handshake MNP, so
>>it looks like someone originally designed it to be an MNP modem, then
>>that capability was removed.  Unfortunately, it wasn't removed enough.
>>ATT was surprised.
>
>no, I don't buy that story.  The Unix PC was on the market before
>MNP ever was.  MNP is *not* very old!

The actual story isn't the issue, but the observable facts are.

By the way:  MNP modems have been on the market for several years
(probably longer than many realize), and the 3b1/7300 hasn't
necessarily been on the market as long as some would think (for a
discontinued machine).  There is no reason to believe that the 3b1/7300
could NOT have MNP if they wanted it, unless the timing was just too
close between the two.

I have tested dialing inbound to a number of different aged and
configuration 3b1's and 7300s with several modems.  Most were capable
of MNP class 3, others capable of MNP class 5.

They all reported MNP handshake upon connection, whether you were in
reliable or autoreliable mode.  MNP modems have been on the market for
several years, and the design existed long before they were made
available (of course). 

Comparably, when these modems were set in autoreliable mode, a 3b1
OBM dialing _into_ them would not handshake MNP.  If those modems
were set to reliable mode, no connection could be made, of course.

Checking status registers during a "live" connection verifies that
the modem has indeed successfully and correctly handshaked an MNP
connection.  The only thing I haven't attempted to discover yet is
what level of MNP it will actually handshake at.

The condition was reported to ATT, which they found surprising.  They
tried it out, and like magic, an MNP modem handshook just like that.
Don't ever expect to see a fix, though, it's firmware (unless they for
some reason decide to fix something else in the OBM firmware (highly
unlikely) in which case they'll probably forget to address this as
well... :-)

Now, there is one suggestion for the skeptics:  Try it yourself.  I'm
relatively sure it's not a condition limited only to Minnesota and New
York.  :-)  Your Trailblazer is capable of MNP, so there's a
diagnostic tool.

And for those, there are probably two possibilities of why it'll
handshake: 

1) ATT (and/or Convergent) were possibly working on the protocol
   together, or they had some sort of agreement with MicroCom for
   it's use.  There is a long list of possibilities of why the modem
   might handshake MNP and yet never utilize it.  The only people who
   would know for sure would most likely be anyone directly associated
   with the original "Safari 4" design team.
2) Pure dumb luck that this chip just happens to understand an MNP
   handshake request and respond to it accordingly.  Not entirely 
   likely.  Heck of a coincidence, though!  :-)

By the way:  I do like to verify my facts before I post them.  It
makes me feel confident that what I post is as close to fact as
possible.  When it comes to possibilities, it's nice to use phrases
like "could be" or "possibly" or "a possible scenario", etc.  When the
true story and reasons are important, then someone will do their best
to find that out.  In this case, _why_ it does it is not important,
but the fact that it does, is.  

This whole thing comes back to the very same suggestion as was in the
previous note:  If you are running as an autoreliable dialout MNP
modem, make sure you turn _off_ all MNP before you try to connect to a
3b1/7300, or you'll get a connect, but that's it.  

'Nuf said (more than :-), I hope.
 
-- 
"I've been trying for some time to                           Robert J. Granvin
 develop a life-style that doesn't          National Information Systems, Inc.
 require my presence."                                       rjg@sialis.mn.org
    -Garry Trudeau                ...{{amdahl,hpda}!bungia,rosevax}!sialis!rjg

mml@magnus.UUCP (Mike Levin) (07/12/88)

In article <9915@g.ms.uky.edu> david@ms.uky.edu (David Herron -- One of the vertebrae) writes:
>no, I don't buy that story.  The Unix PC was on the market before
>MNP ever was.  MNP is *not* very old!

I have been using modems from Microcom (who invented Microcom Networking
Protocol, also known as MNP) for at *least* 6 years, and they weren't
brand-new at that time.  So, I have a hunch the MNP out-dates the unix-pc.
Of course, I haven't got any idea how long *it's* been on the market.



-- 
+---+  P L E A S E    R E S P O N D   T O: +---+  *  *  *  *  *  *  *  *  *  *
| Mike Levin, Silent Radio Los Angeles (magnus)| I never thought I'd be LOOKING
| Path {csun|kosman|mtune|srhqla}!magnus!levin |    for something to say! ! !
+----------------------------------------------+------------------------------+