etb@milton.u.washington.edu (Eric Bushnell) (03/02/91)
Has anybody else had their slave rgyd's die mysteriously? It has happened a few times on 10.3 nodes on both token ring and ether networks. The rgyd_error_log reports the following: 1991/02/23. 6:05:27, root.staff.none.1CDEB, Domain/OS kernel(7), revision 10.3, August 22, 1990 3:32:49 pm RGYD version 1.2, 89/10/06 Unable to rename database (network computing system/Registry Server Replication) Checkpoint Task Cannot rename files during checkpoint No fault information available ?(errlog) no fault information available (Debug/cross process traceback) 1991/02/23. 6:05:28, root.staff.none.1CDEB, Domain/OS kernel(7), revision 10.3, August 22, 1990 3:32:49 pm RGYD version 1.2, 89/10/06 IOT instruction fault (UNIX/signal) distinguished_task DT exiting Fault Status 09010006: IOT instruction fault (UNIX/signal) User Fault PC 3B4B1EE2 D0-D3: 00000001 00000008 00000006 00000000 D4-D7: 00000000 00000002 00000000 00000400 A0-A3: 3B1DD888 3C2EB0B0 3B3D00E6 3B37D51C A4-A7: 3B3D00EC 3B49F588 3B1DF8AC 3B1DD85C Supervisor ECB 00000000 Supervisor SR 0000 Supervisor PC 00000000 IOT instruction fault (UNIX/signal) What's a checkpoint? What is it trying to rename, and why? (Who's on first? 8-) ) Eric Bushnell UW Civil Engr etb@zeus.ce.washington.edu etb@milton.u.washington.edu
pato@apollo.HP.COM (Joe Pato) (03/07/91)
In article <17465@milton.u.washington.edu>, etb@milton.u.washington.edu (Eric Bushnell) writes: |> Has anybody else had their slave rgyd's die mysteriously? |> It has happened a few times on 10.3 nodes on both token |> ring and ether networks. |> |> The rgyd_error_log reports the following: |> |> RGYD version 1.2, 89/10/06 |> Unable to rename database (network computing system/Registry Server Replication) |> |> Checkpoint Task |> Cannot rename files during checkpoint |> |> No fault information available |> ?(errlog) no fault information available (Debug/cross process traceback) |> What's a checkpoint? What is it trying to rename, and why? |> |> (Who's on first? 8-) ) |> |> Eric Bushnell |> UW Civil Engr |> etb@zeus.ce.washington.edu |> etb@milton.u.washington.edu This is a known (and infrequent) problem. We plan to fix this in a future release. The checkpoint task wakes up periodically and writes any database changes to the disk (each change is always written to the stable store update log - a checkpoint occurs when the actual database state is saved to disk and the update log is truncated). During the checkpoint a form of 2PC (two phase commit) is used. All the database files are first written as foo.new and then when the database is completely on disk, the file names are changed so that the new version becomes current. Finally the previous version of the files (now named .bak) are removed since they are obsolete. For some reason the rename() call will sometimes yield an error and the server interprets this as a catastrophic error. We have never seen any damage due to this error - the server will properly clean up the database files the next time it is restarted. -- Joe Pato Cooperative Computing Division Hewlett-Packard Company pato@apollo.hp.com -- Joe Pato Cooperative Computing Division Hewlett-Packard Company pato@apollo.hp.com
krowitz@RICHTER.MIT.EDU (David Krowitz) (03/07/91)
Joe, I think the "rename" error is a little more frequent than you suspect ... I have to restart *each* of my 4 rgyd replica's on the order of about once every 1 to 2 weeks. == Dave
etb@milton.u.washington.edu (Eric Bushnell) (03/08/91)
In article <9103071456.AA01889@richter.mit.edu> krowitz@RICHTER.MIT.EDU (David Krowitz) writes: >Joe, I think the "rename" error is a little more frequent than you >suspect ... I have to restart *each* of my 4 rgyd replica's on the >order of about once every 1 to 2 weeks. Agreed. It happens often enough to irritate me. *So far* only slaves have died like that. The master hasn't choked yet. Eric Bushnell UW Civil Engr etb@zeus.ce.washington.edu
pato@apollo.HP.COM (Joe Pato) (03/09/91)
In article <17952@milton.u.washington.edu>, etb@milton.u.washington.edu (Eric Bushnell) writes: |> In article <9103071456.AA01889@richter.mit.edu> krowitz@RICHTER.MIT.EDU (David Krowitz) writes: |> >Joe, I think the "rename" error is a little more frequent than you |> >suspect ... I have to restart *each* of my 4 rgyd replica's on the |> >order of about once every 1 to 2 weeks. |> |> Agreed. It happens often enough to irritate me. *So far* only |> slaves have died like that. The master hasn't choked yet. |> |> Eric Bushnell |> UW Civil Engr |> etb@zeus.ce.washington.edu Thanks, this is useful information. This has not been a high priority problem since we have gotten few complaints - and the complaints have generally referred to a MTBF of ~1 month. -- Joe Pato Cooperative Computing Division Hewlett-Packard Company pato@apollo.hp.com
marmen@bwdla31.bnr.ca (Rob Marmen 1532773) (03/18/91)
We had this problem for a long time. Finally, at sr10.3 and running the new version of rgyd (not included at sr10.3 due to an oversite) the registry is very reliable. We occasionally have a failure with the registry locking it's database. However, our standard procedure is to just remove /sys/registry/rgy_data and recreate the slave/master. This occurs about once a month. We don't view registries as a serious problem anymore. We've tailored our environment to suite the Apollo's assets and liabilities. HP/Apollo could improve their image by advertising this combination. rob... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- | Robert Marmen marmen@bnr.ca OR | | Bell Northern Research marmen%bnr.ca@cunyvm.cuny.edu | | (613) 763-8244 My opinions are my own, not BNRs |