[comp.sys.apollo] rgyd dies occasionally

etb@milton.u.washington.edu (Eric Bushnell) (03/02/91)

Has anybody else had their slave rgyd's die mysteriously?
It has happened a few times on 10.3 nodes on both token
ring and ether networks.

The rgyd_error_log reports the following: 

1991/02/23. 6:05:27, root.staff.none.1CDEB, Domain/OS kernel(7), revision 10.3, August 22, 1990  3:32:49 pm
RGYD version 1.2, 89/10/06
Unable to rename database (network computing system/Registry Server Replication)

Checkpoint Task
Cannot rename files during checkpoint

No fault information available
?(errlog) no fault information available (Debug/cross process traceback)

1991/02/23. 6:05:28, root.staff.none.1CDEB, Domain/OS kernel(7), revision 10.3, August 22, 1990  3:32:49 pm
RGYD version 1.2, 89/10/06
IOT instruction fault (UNIX/signal)

distinguished_task
DT exiting

Fault Status    09010006: IOT instruction fault (UNIX/signal)
User Fault PC   3B4B1EE2
D0-D3:          00000001 00000008 00000006 00000000
D4-D7:          00000000 00000002 00000000 00000400
A0-A3:          3B1DD888 3C2EB0B0 3B3D00E6 3B37D51C
A4-A7:          3B3D00EC 3B49F588 3B1DF8AC 3B1DD85C
Supervisor ECB  00000000
Supervisor SR   0000
Supervisor PC   00000000
IOT instruction fault (UNIX/signal)

What's a checkpoint? What is it trying to rename, and why?

(Who's on first? 8-) )

Eric Bushnell
UW Civil Engr
etb@zeus.ce.washington.edu
etb@milton.u.washington.edu

pato@apollo.HP.COM (Joe Pato) (03/07/91)

In article <17465@milton.u.washington.edu>, etb@milton.u.washington.edu (Eric Bushnell) writes:
|> Has anybody else had their slave rgyd's die mysteriously?
|> It has happened a few times on 10.3 nodes on both token
|> ring and ether networks.
|> 
|> The rgyd_error_log reports the following: 
|> 

|> RGYD version 1.2, 89/10/06
|> Unable to rename database (network computing system/Registry Server Replication)
|> 
|> Checkpoint Task
|> Cannot rename files during checkpoint
|> 
|> No fault information available
|> ?(errlog) no fault information available (Debug/cross process traceback)

|> What's a checkpoint? What is it trying to rename, and why?
|> 
|> (Who's on first? 8-) )
|> 
|> Eric Bushnell
|> UW Civil Engr
|> etb@zeus.ce.washington.edu
|> etb@milton.u.washington.edu

This is a known (and infrequent) problem.  We plan to fix this in a future
release.

The checkpoint task wakes up periodically and writes any database changes
to the disk (each change is always written to the stable store update log - 
a checkpoint occurs when the actual database state is saved to disk and 
the update log is truncated).

During the checkpoint a form of 2PC (two phase commit) is used.  All the
database files are first written as foo.new and then when the database is
completely on disk, the file names are changed so that the new version
becomes current.  Finally the previous version of the files (now named
.bak) are removed since they are obsolete.  For some reason the rename()
call will sometimes yield an error and the server interprets this as a
catastrophic error.

We have never seen any damage due to this error - the server will properly
clean up the database files the next time it is restarted.

                    -- Joe Pato
                       Cooperative Computing Division
                       Hewlett-Packard Company
                       pato@apollo.hp.com


                    -- Joe Pato
                       Cooperative Computing Division
                       Hewlett-Packard Company
                       pato@apollo.hp.com

krowitz@RICHTER.MIT.EDU (David Krowitz) (03/07/91)

Joe, I think the "rename" error is a little more frequent than you
suspect ... I have to restart *each* of my 4 rgyd replica's on the
order of about once every 1 to 2 weeks.

== Dave

etb@milton.u.washington.edu (Eric Bushnell) (03/08/91)

In article <9103071456.AA01889@richter.mit.edu> krowitz@RICHTER.MIT.EDU (David Krowitz) writes:
>Joe, I think the "rename" error is a little more frequent than you
>suspect ... I have to restart *each* of my 4 rgyd replica's on the
>order of about once every 1 to 2 weeks.

Agreed. It happens often enough to irritate me. *So far* only
slaves have died like that. The master hasn't choked yet.

Eric Bushnell
UW Civil Engr
etb@zeus.ce.washington.edu

pato@apollo.HP.COM (Joe Pato) (03/09/91)

In article <17952@milton.u.washington.edu>, etb@milton.u.washington.edu
(Eric Bushnell) writes:
|> In article <9103071456.AA01889@richter.mit.edu>
krowitz@RICHTER.MIT.EDU (David Krowitz) writes:
|> >Joe, I think the "rename" error is a little more frequent than you
|> >suspect ... I have to restart *each* of my 4 rgyd replica's on the
|> >order of about once every 1 to 2 weeks.
|> 
|> Agreed. It happens often enough to irritate me. *So far* only
|> slaves have died like that. The master hasn't choked yet.
|> 
|> Eric Bushnell
|> UW Civil Engr
|> etb@zeus.ce.washington.edu

Thanks, this is useful information.  This has not been a high priority
problem since we have gotten few complaints - and the complaints have
generally referred to a MTBF of ~1 month.

                    -- Joe Pato
                       Cooperative Computing Division
                       Hewlett-Packard Company
                       pato@apollo.hp.com

marmen@bwdla31.bnr.ca (Rob Marmen 1532773) (03/18/91)

We had this problem for a long time. Finally, at sr10.3 and
running the new version of rgyd (not included at sr10.3 due to
an oversite) the registry is very reliable.

We occasionally have a failure with the registry locking it's
database. However, our standard procedure is to just remove
/sys/registry/rgy_data and recreate the slave/master. This occurs
about once a month.

We don't view registries as a serious problem anymore. We've
tailored our environment to suite the Apollo's assets and liabilities.

HP/Apollo could improve their image by advertising this combination.

rob...

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
| Robert Marmen             marmen@bnr.ca  OR             |
| Bell Northern Research    marmen%bnr.ca@cunyvm.cuny.edu |
| (613) 763-8244         My opinions are my own, not BNRs |