rcr@logixwi.uucp (Rainer Ruppert) (06/24/91)
Hi Networkers, yesterday a friend of mine, edp-manager, said: "The last week the salesman of all the big database companys joined me. Most of them wanted to sell the *two phase commit* for our new distributed system, but I could not believe that the tpc is really able to make distributed systems recoverable." At first, I remebered Ceri & Pelagatti and Date and told him, be sure, its save. But seconds later I wasn't quite that sure then before. Let us suppose following scenario: Three systems a,b,c where a is the coordinator b,c are the agents managing two tables which are under update operation. The coordinator is doing the phase one and after completing it, phase two. Phase two is completed on b and c, but the final quit (ack) from c could not received by the coordinator because the network dropped it while a millisecond of blabla... . System c is releasing the locks and the local actions on that system are able to manipulate the c-local table, while the coordinator a is rejecting the transaction on a and b. The database is corrupted-or ??? Ok, the possibility to loose an ack, because of a millisecond network failure isnot very high, and normal database systems are trying to get the ack again, but suppose it is a problem to transmit the ack, then you have that situation. In my opinion CP and Date discussed only the situation for the distributed database without discussing problems with the underlying network. Does anyone else have thought about that problem ? Is there a solution with which the boy could be satisfied ? thanks, rcr -- Rainer Ruppert, Logix GmbH rcr@logixwi.UUCP Moritzstr. 50, D-6200 Wiesbaden ...!uunet!mcsun!unido!logixwi!rcr
rick@tetrauk.UUCP (Rick Jones) (06/24/91)
In article <1991Jun23.191959.29921@logixwi.uucp> rcr@logixwi.uucp (Rainer Ruppert) writes: >Hi Networkers, > >yesterday a friend of mine, edp-manager, said: "The last week the salesman >of all the big database companys joined me. Most of them wanted to sell the >*two phase commit* for our new distributed system, but I could not believe that >the tpc is really able to make distributed systems recoverable." > >At first, I remebered Ceri & Pelagatti and Date and told him, be sure, its save. >But seconds later I wasn't quite that sure then before. Let us suppose following >scenario: > > Three systems a,b,c where a is the coordinator b,c are the agents > managing two tables which are under update operation. > > The coordinator is doing the phase one and after completing it, phase > two. Phase two is completed on b and c, but the final quit (ack) from > c could not received by the coordinator because the network dropped it > while a millisecond of blabla... . > > System c is releasing the locks and the local actions on that system > are able to manipulate the c-local table, while the coordinator a > is rejecting the transaction on a and b. > > The database is corrupted-or ??? The key requirement for reliable two-phase commit (2PC) is that the precommited state must be stored in non-volatile form on all participating systems. When a system accepts a precommit request, it is _guaranteeing_ that it will not fail to fully commit, under any curcumstances, although it is still able to abort if requested. This guarantee should be able to survive the scenario where the system goes down between precommit and final commit, requiring that the local database be rebuilt from transaction logs. The precommitted state is itself logged in the database. Consequently, the transaction coordinator must be able to continue attempting to secure a full commit indefinitely, which means that it also stores the transaction state in non-volatile form. It must certainly not be susceptible to low-level protocol glitches. 2PC relies on getting to points of no-return. A participating system gives up the right to force an abort when it accepts a precommit. The coordinator cannot request an abort once it has started the full commit sequence. If at this point contact fails, then it just goes on indefinitely attempting to reconnect, and secure the full commit - it has no other choice. You will probably find that in most implementations, the precommit action performs the database updates, while keeping enough information in the logs to allow the action to be undone if required. Only very occasionally will a successful precommit be followed by an abort request from the coordinator. -- Rick Jones, Tetra Ltd. Maidenhead, Berks, UK rick@tetrauk.uucp Chairman, NICE (Non-profit International Consortium for Eiffel) Any fool can provide a solution - the problem is to understand the problem
dhepner@hpcuhc.cup.hp.com (Dan Hepner) (06/25/91)
From: rcr@logixwi.uucp (Rainer Ruppert) >scenario: > > Three systems a,b,c where a is the coordinator b,c are the agents > managing two tables which are under update operation. > > The coordinator is doing the phase one and after completing it, phase > two. Phase two is completed on b and c, but the final quit (ack) from > c could not received by the coordinator because the network dropped it > while a millisecond of blabla... . > > System c is releasing the locks and the local actions on that system > are able to manipulate the c-local table, while the coordinator a > is rejecting the transaction on a and b. > > The database is corrupted-or ??? Two phase commit is absolutely guaranteed to keep distributed databases synchronized. Check out a description of two phase commit a little more. Once phase 1 is completed, the coordinator logs that fact, and with that log the fate of the transaction is sealed (committed). In the example posted, the transaction is not rejected by A and B, and indeed this situation is solidly addressed by 2PC. B has no need to be informed of the loss of C. When C does return, a recovery protocol will establish to A that C indeed finished the transaction. Should the comm error have happened on the way from A to C, the recovery protocol would have advised A that the transaction had indeed committed. The only irritant with 2PC is that it is subject to "getting stuck", when say B acknowledges phase one commit, and C receives the phase 1 commit but does not respond, perhaps having caught fire, but perhaps just having it's phone line down. The 2PC protocol is stuck waiting for C to respond, and cannot time-out to a decision without the potential for inconsistency. A system administrator (or automation thereof) may in this case make decisions which override the "stuck" two phase commit, and such overrides have the potential to create inconsistency. Dan Hepner
dhepner@hpcuhc.cup.hp.com (Dan Hepner) (06/29/91)
Before getting on an airplane I wrote:
>The only irritant with 2PC is that it is subject to "getting stuck",
Which I went on to explain poorly.
After completing phase 1, a subserviant conceivably can lose contact
with the commit coordinator. After proper patience, the administrator
of the subserviant machine may decide that the cost of any potential
inconsistency which may be caused by making some unilateral decision
is overridden by a need to free up resources held by the uncompleted
transaction. This is called by some literature a "heuristic decision".
Most formal specifications of two phase commit anticipate such
an override of the protocol proper, and provide for "someone" to
eventually be notified of what has happened, so long as anyone
checks.
Dan Hepner