[comp.sys.apollo] DN10000 problems

herb@blender.uucp (Herb Peyerl) (12/10/90)

rtaylor@tron.UUCP (Randy Taylor) writes:
[Various symptoms of a DN10k hanging for no reason deleted]

We have a couple of DN10k's one of which we bought from Mentor.  Up
until just recently, Mentor was only supporting SR10.0.p which had
a slight problem with dspst.  Whenever you'd do a dspst of the
node, it would crash with a status of 40004 (reference to illegal
address).  Could it be that you're running this release of the OS
and your users are dspst'ing the node?

-- 
--------------------------------------------------------------------------
UUCP: herb@blender.UUCP   || #define Janitor Administrator
ICBM: 51 03 N / 114 05 W  || Apollo System_Janitor, Novatel Communications
"I spilled spot remover on my dog and now he's gone..." <Steven Wright>

rtaylor@tron.UUCP (Randy Taylor) (12/11/90)

I have a weird situation with my DN10k and the posting listed below may
be part (or all) of the solution :

::FYI:
::  DN 10000 owners:
::   We have experienced some severe power supply shutdown problems
::on our 10K.  Symptoms are:
::   1) System loses main power.
::   2) Standby lamp on front pannel lit
::   3) Error status 30 (HEX) on EuroCard power supply plugged into
::      the bottom of the X bus.
::
::   There is a SERVICE NOTE on the +5 V portion of the 10k
::power supplies dated 18 June 1990.  A summary of the text follows:
::
::DN100X0/DSP100X0
::Serial Numbers: All
::
::Date Code: All +5 Volt Booster Modules with 1988 Date Codes
::Performed By: HP/Apollo Qualified Service Personnel Only
::Parts Required: +5 Volt, 150W Control Module (APN 010524-001)
::
::Situation:
::A problem has been identified with the DN100X0/DSP100X0 power sys-
::tems in both Manufacturing and the Field. The power system shuts down
::due to a +5 Volt OV (Over Voltage) failure.
::
::Having evaluated several Power EuroCards from Manufacturing and
::returns from the field, R&D has identified an oscillation on some of the 
::+5 Volt Booster DC/DC Converters.  This oscillation forces the +5V
::output voltage to exceed +5.3V dc and the microprocessor shuts down
::the power system.
::
::After having tested different +5 Volt Booster Module configurations,
::R&D has concluded that Booster Modules with 1988 Datecodes are the
::direct cause of the +5 Volt OV (overvoltage failures).
::
::                             -jjw
::waldram@grizzly.uwyo.edu
::jwaldram@outlaw.uwyo.edu
::jwaldram@UWYO.BITNET

(1)
My DN10k hangs up from time to time, i.e. if I try to crp to it, the shell
hangs until I try to pst or ld it from another shell - then the crp attempt
unhangs.

(2)
Disk access time slows down dramatically for no apparent reason whatsoever.
This problem is periodic and unpredictable.

(3)
The DN10k crashes frequently. Afterwards, the orphans can be numerous - in
one case, the dn10k lost track of 106 files ! :-()
                
(4)
Finally, the other day, the DN10k shut down with the symptoms noted
in the posting.

I had thought that it was a combination of hardware and software, but from
the above posting, it looks like it may be the power supply.

Has anyone else out there had similar problems ?

Thanks in advance,

Randy Taylor
Westinghouse Electric Corporation
Electronic Systems Group
Friendship Site
PO Box 1897 - MS 759
Baltimore, MD 21203
-- 
rtaylor@sky00.bwi.wec.com  from an Internet site (preferred) 
rtaylor@tron.bwi.wec.com   from an Internet site (alternate)

"...you know I have the greatest enthusiam for the mission." HAL 9000

system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) (12/15/90)

In article <677@tron.UUCP> rtaylor@tron () writes:
>I have a weird situation with my DN10k and the posting listed below may
>be part (or all) of the solution :
>
>(1)
>My DN10k hangs up from time to time, i.e. if I try to crp to it, the shell
>hangs until I try to pst or ld it from another shell - then the crp attempt
>unhangs.

We have seen this kind of behaviour for 2 years - rlogin/telnet delay,
hang completely, or even lose the connection at random. This happens
almost daily, and the system crashes on average once a week when
TCP services "disappear" although tcpd is still running.
(Our DN10020 is on Ethernet, which is apparently strongly related to the
problem.)

>(2)
>Disk access time slows down dramatically for no apparent reason whatsoever.
>This problem is periodic and unpredictable.

How do/did you determine this? I'd like to test for this somehow,
because I'm pretty sure we see this behaviour when 2 I/O bound jobs
run against one another.

>(3)
>The DN10k crashes frequently. Afterwards, the orphans can be numerous - in
>one case, the dn10k lost track of 106 files ! :-()

To help with this problem, run 'update' from your /etc/rc file - Apollo
will tell you it is not needed, but we have lost very few files in
our many many crashes. Update does a 'sync' every 30 (?) seconds, not
the many minutes that Domain/OS keeps disks up to date.

>(4)
>Finally, the other day, the DN10k shut down with the symptoms noted
>in the posting.

Time for a new power supply board!
-- 
Mike Peterson, System Administrator, U/Toronto Department of Chemistry
E-mail: system@alchemy.chem.utoronto.ca
Tel: (416) 978-7094                  Fax: (416) 978-8775

rees@pisa.ifs.umich.edu (Jim Rees) (12/15/90)

In article <1990Dec14.192114.5310@alchemy.chem.utoronto.ca>, system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) writes:

  To help with this problem, run 'update' from your /etc/rc file - Apollo
  will tell you it is not needed, but we have lost very few files in
  our many many crashes. Update does a 'sync' every 30 (?) seconds, not
  the many minutes that Domain/OS keeps disks up to date.

Actually, it's not so much that it's not needed as that it doesn't really do
what you think it does.  It doesn't write out everything as it would in,
say, Berkeley Unix.  All it does is force-write locked objects.  In the case
of remote objects that's good, since objects are always written out from
local memory when they're unlocked.  In the case of local objects, it's
possible to unlock a recently modified object, then call sync, and the
object won't get written to disk.

But it does help, even though it doesn't absolutely guarantee the integrity
of the disk.  Actually, bsd doesn't guarantee this either, since sync just
schedules writing, it doesn't really do it.  At least Domain/OS sync really
does the write.  To put it another way:  bsd sync is asynchronous; Domain/OS
sync is synchronous.

Domain/OS has a pair of processes, the purifiers (pids 3 & 4), that
progressively write dirty objects to disk.  I think they only write pages
from unlocked objects.  So the combination of the purifiers and sync should
be sufficient.

rtaylor@tron.UUCP (Randy Taylor) (01/03/91)

Mike -

I tested for the disk access problem this way :

 1) pst - to check for any CPU-intensive stuff like simulations
          and whatnot. 
          
 2) list a dir on any DN10k tree after verifying that nothing other
    than O.S. processes were running.


Granted, this is not a very technical solution, but when the problems
arise, a simple listing of a directory will show it up. Normally, the DN10k
is *very* fast, but when it slows down, it runs at the speed of a dying
DN300 - no that is *not* an exaggeration !

By the way, I upgraded the DN10k to 10.3.p a couple of weeks ago and I
haven't seen any problems (yet). :-)

Randy Taylor
-- 
rtaylor@sky00.bwi.wec.com  from an Internet site (preferred) 
rtaylor@tron.bwi.wec.com   from an Internet site (alternate)

"...you know I have the greatest enthusiam for the mission." HAL 9000

rtaylor@tron.UUCP (Randy Taylor) (01/03/91)

Herb -

No. We were running 10.1.p. We have since upgraded to 10.3.p and the problem
has not returned (yet :-) !)

Randy Taylor
-- 
rtaylor@sky00.bwi.wec.com  from an Internet site (preferred) 
rtaylor@tron.bwi.wec.com   from an Internet site (alternate)

"...you know I have the greatest enthusiam for the mission." HAL 9000