elliot@alfred.UUCP (Elliot Dierksen) (01/11/90)
I have been having a very annoying problem lately. While my system (3.51a 4MB RAM, HDB uucp, 2224CE0 modem) is talking to my main news feed (386 PC w internal hayes comp. modem). It crashes. I come in and DTR is still being sent to the modem. I hit a key to try and unblank the screen and nothing happens. However, it always comes right back up when I hit reset. It only seems to happen when they call me, not when I call them. I don't know if it is just a coincidence that it is always the same system. It is my primary news feed, and we move around 1 MB of data a day. The wierd thing is that it isn't consistent. Sometimes I'll go a week with no problems, and this week it has carashed 3 or 4 times?? Help!! EBD -- Elliot Dierksen "I don't care if my lettuce has DDT on it, as long as it's crisp!!" -- Jorma Kaukonen Work) {att,codas}!candi!fang!ebd (407) 660-3377 Home) {peora,uunet,ucf-cs}!tarpit!alfred!elliot (407) 290-9744
wtm@neoucom.UUCP (Bill Mayhew) (01/14/90)
We get random crashes on our 3b1, neoucom, which serves as our gateway to the uucp domain. Running the stock version 2, we had more crashes than we get with the HDB system. The bugs seem to happen if you have more than one uucp running at a time, such as ph1 and tty000 together. The port driver seems to get confused while handling the 7201 interrupts and leaves a boluxed address in one of the 68010 registers. Supposedly this problem was fixed in 3.51d (which it would appear is only available inside of AT&T at this moment). With HDB, you can set Maxuuxqts to 1, which it seems, should prevent multiple uucicos from running. None the less, we get an occasional crash every week or two. That isn't too awful, considering the 3b1 handles 2 or more megabytes of news and mail on busy days. If that news gets bottled up, it can be a problem, so I built a hardware solution. I looked through my junk box and came up with a 6502, 6522, 2716, 6116, and a 16 character LCD display. I put together a very small computer that watches the tty000 port on the 3b1. If my box doesn't see any activity on the tty port for more than 70 minutes, it turns off a solid state relay for 30 seconds to cut power to the 3b1, forcing a reboot. We set up Poll to make sure that uucico runs once an hour. The 6502 program is only about 1K of assembly level code, and most of that is for running the LCD display. I also wrote in support for input from an overtemp thermistor, but we've been too lazy to disassemble the 3b1 to mount a termistor on the power supply. I experimented with measuring the temperature of the air coming out of the fan grille, but variations in the room temperature were too great to differentiate between fan failure and ambient variation. I also ruled out using an air pressure sensor as too unrliable as well. On our 3b1, the air leaving the fan is only about 5 degrees C above ambient. I suppose one could take a differential temperature measurement, but laziness kept me from getting that fancy.... Bill
bdb@becker.UUCP (Bruce Becker) (01/16/90)
In article <1871@neoucom.UUCP> wtm@neoucom.UUCP (Bill Mayhew) writes: | |We get random crashes on our 3b1, neoucom, which serves as our |gateway to the uucp domain. Running the stock version 2, we had |more crashes than we get with the HDB system. The bugs seem to |happen if you have more than one uucp running at a time, such as |ph1 and tty000 together. The port driver seems to get confused |while handling the 7201 interrupts and leaves a boluxed address in |one of the 68010 registers. Supposedly this problem was fixed in |3.51d (which it would appear is only available inside of AT&T at |this moment). With HDB, you can set Maxuuxqts to 1, which it |seems, should prevent multiple uucicos from running. I run HDB with multiple uuxqt's, multiple uucico's, & all sorts of other stuff at the same time - it *never* crashes. I tend to reboot every couple of months just to dust off the memory chips, but it's just being cautious. On the other hand, I don't use the OBM, because it's a pretty flaky device. If you aren't using the internal modem, then you've probably got some hardware problem. My system used to act up once in a while, but it was just a case of reseating some chips & cables inside - things have worked just fine ever since... -- ,,,, Bruce Becker Toronto, Ont. w \$$/ Internet: bdb@becker.UUCP, bruce@gpu.utcs.toronto.edu `/c/-e BitNet: BECKER@HUMBER.BITNET _/ >_ "Money is the root of all money" - Adam
rhl@eci386.uucp (Richard Lathwell) (01/17/90)
In article <2277@becker.UUCP> bdb@becker.UUCP (Bruce Becker) writes: > In article <1871@neoucom.UUCP> wtm@neoucom.UUCP (Bill Mayhew) writes: > |We get random crashes on our 3b1, neoucom ... > > uucico's, & all sorts of other stuff at > the same time - it *never* crashes. I > tend to reboot every couple of months > just to dust off the memory chips, but > On the other hand, I don't use the OBM, > because it's a pretty flaky device. At ECI, we have 3 3b1s: 2 are connected to our gateway (named gate in the maps) by starlan. gate has 3.5 meg memory (via a fully populated Combi card), and three rs232 ports. The built-in rs232 port (tty000) drives an Apple LaserWriter, the other two are two bidirectional connections to our 386 that doesn't understand starlan (but the 386 supports all 8 terminals in the office). Gate's On Board Modem is directly wired to Bell through our AT&T PDS; the 386 has a modem on one of its built-in ports that it shares with a fax machine. An IBM PC with a starlan card is in the net and uses all of the 3b1s as both file servers and print servers. The 386 has an rs232 connection to another 3b1 (named schiz because it has a DOS-73 coprocessor that also *never* crashes) as an alternate route to the LaserWriter via starlan to gate. Gate's OBM handles about 150 calls per day and has done so for about three years. When gate crashes (about once a month) we usually find evidence of a kernel bug - "page fault in kernel", corrupted /etc/wtmp on a block boundary, etc. Re: *never*: Schiz made it 207 days before the electronics on its disk fried. We've replaced the fans on all of the 3b1s as they've died - overheating because of a dead fan hasn't fried a 3b1 yet - they go into a reboot cycle: (Ouch! I'm hot! Power off... Power on ... Reboot Ouch! I'm hot! Power off ... ad infinitum until someone comes into the office and hears the poor sucker calling Help! Help!, says "WTF?", turns off the power switch, and after letting it cool down and running diagnostics, replaces the fan). In other words, In ECI's collective experience, they *never* fail. The OBM works fine - it's the path of choice to ECI. We're running a mixture of HDB and AT&T (Convergent Technologies?) versions of uucp (and cu, etc.). They both seem to work and I've never seen an unrecoverable failure ascribed to either. -- RHL - rhl@eci386
jcm@mtune.ATT.COM (John McMillan) (02/02/90)
In article <1990Jan31.170216.27161@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes: : >Well thanks a whole hell of a lot: Your attitude is contageous. > 1) This isn't RS232, this is OBM - different driver. The only > thing I have on RS232 is a 300 baud diablo printer and the 3b1 > has *never* crashed during printing. Reconsider. Only the phone-line polling and call setup software differ. Your OBM is fed from the On Board RS232 [sans line-drivers]! Their shared hardware was so intertwined it took years to find some insidious the bugs resulting from the sharing. [Sorry, much as we'd like, we can't take credit for writing the original drivers: we were too busy planning to screw up your phone network calls -- ref: below.] > 2) I've been asking this question every other month for quite a bit > longer than you've been posting to this newsgroup, and have *never* > gotten any response other than vague references to the power supply. > (It can't be the powersupply, because the machine has been completely > replaced and is still exhibiting the same problem at the same > frequency, *AND* none of our other 6 machines have ever panicked > in this fashion - some of which UUCP more per day than ecicrl > does. They're all the same version of the O/S. So, it must be > something environmental). Well... this shatters me. While I've worked hard at answering questions though lacking adequate data, I've a ways to go. 'Seems to me this is the 1st time I've caught references to your DUART activity. [OK, I admit it: I stopped curling up with your notes at night, and I don't recall them all.] : > 5) It's worthy of note that people are *still* suggesting power supply > problems - in particular, people as knowledgable as John Milton... > So it ain't all that well known. As I stated: the problem is well understood and well documented. In the "brief" time -- compared to your contributions -- that I've been posting to this group, I've described the problem several times. I've also described it in technical conversations and memos within AT&T. If you present "crashes during DUART activites" to a sober, knowledgable 3B1-support person, they should recognize the strong possibility of illegally interleaved command sequences to a DUART chip. I'm *NOT* saying your problem *IS* the DUART problem: just that these are the symptoms of it. I would ALSO consider power-supply problems and noise problems -- I would NOT be running this machine without Ruby(tm) [or analogous] line- conditioners. I would NOT be running this with an ancient kernel that has not benefitted from at least the 3.51 fixes. > 6) It's worthy of note that several very knowledgable people in AT&T > have been consulted (outside of normal support channels) and *no* > concrete suggestions or suspicions have ever been expressed. I'm hardly privy to the details you presented them, but I'll take your word they were ultra-knowledgable. It's entirely possible that the Tier-II & Tier-IV staff I've explained the problem to have deliberately kept this matter a secret. [After all, THEY don't get a chance to nobble the phone network as often as they'd LIKE! So... they take it out in other directions.] > 7) You might want to ask Lenny what happened to our 3.51 upgrade... > (Though it's not his fault...) OK -- 'Fess up, Lenny. We know you've got it and we're showing up tonight to take it or you're gonna regret it! Actually, in -- am I right Lenny? -- hundreds [seems like thousands] of communications with Lenny, he's failed to mention your upgrade. [You sly bugger, Lenny.] >American Telephone and Telegraph: one might make similar suggestions regarding >your company's inability to keep the long distance telephone network running... Cute. Trite, simplistic, and irrelevant... but cute! >Chris Lewis, Elegant Communications Inc, {uunet!attcan,utzoo}!lsuc!eci386!clewis >Ferret mailing list: eci386!ferret-list, psroff mailing list: eci386!psroff-list john mcmillan -- att!mtune!jcm
rjg@nis.mn.org (Robert J. Granvin) (02/03/90)
In article <287@mtune.ATT.COM> jcm@mtune.ATT.COM (John McMillan) writes: >In article <1990Jan31.170216.27161@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes: > >> 7) You might want to ask Lenny what happened to our 3.51 upgrade... >> (Though it's not his fault...) > > OK -- 'Fess up, Lenny. We know you've got it and we're > showing up tonight to take it or you're gonna regret it! > > Actually, in -- am I right Lenny? -- hundreds [seems like > thousands] of communications with Lenny, he's failed to > mention your upgrade. [You sly bugger, Lenny.] Really! Have you guys been holding out on me? Geez! After I offered you a beer and pizza, John... You know these guys for years, and then you learn something like this... I'm crushed! I guess I'm not going to name my first born male child Lenny John now. :-( >>American Telephone and Telegraph: one might make similar suggestions regarding >>your company's inability to keep the long distance telephone network running... > > Cute. Trite, simplistic, and irrelevant... but cute! Actually, I was very impressed by the high quality and efficiency of the software work that was done to bring this about. Very impressive, very complete and very bullet proof! The efficiency and speed of such a large scale networked application to perform line busying definately is a showcase of a product designed to handle so many different conditions in such a rapid fashion. I hope to strive to those levels someday... Oh, by the way, obligatory :-)'s all over the place for the sarcastically or humor impaired. -- _________Robert J. Granvin_________ duckint: a non-floating duck. INTERNET: rjg@nis.mn.org BITNET: rjg%nis.mn.org@nic.mr.net UUCP: ...amdahl!bungia!nis!rjg
lenny@icus.islp.ny.us (Lenny Tropiano) (02/04/90)
In article <287@mtune.ATT.COM> jcm@mtune.ATT.COM (John McMillan) writes: |>In article <1990Jan31.170216.27161@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes: [... mucho left out here ...] |> |>> 7) You might want to ask Lenny what happened to our 3.51 upgrade... |>> (Though it's not his fault...) |> |> OK -- 'Fess up, Lenny. We know you've got it and we're |> showing up tonight to take it or you're gonna regret it! |> |> Actually, in -- am I right Lenny? -- hundreds [seems like |> thousands] of communications with Lenny, he's failed to |> mention your upgrade. [You sly bugger, Lenny.] |> 100% correct, John. The upgrade Chris is talking about is beyond me. In fact, I'm not sure what I'm missing here. All I know is _most_ of the serious problems that were "correctable" were addressed in the 3.51m kernel. We have a lot to thank for that. AT&T could have easily canned any efforts for supporting software after they MD'd [manufacturer discontinued] the product. If it wasn't for a few dedicated people on the net, and good cooperation from the 3B1-hackers at AT&T, we'd still be looking at the bugs in the 3.51a kernel and other utilities found within the FIXDISK 2.0. Let's be thankful for that folks. -Lenny -- | Lenny Tropiano ICUS Software Systems lenny@icus.islp.ny.us | | {ames,pacbell,decuac,hombre,sbcs,attctc}!icus!lenny attmail!icus!lenny | +------- ICUS Software Systems -- PO Box 1; Islip Terrace, NY 11752 -------+