[comp.databases] Summary: Literature on two-phase commit, distributed databases, etc.

barnett@grymoire.crd.ge.com (Bruce Barnett) (01/31/91)
Here is the summary I got. Thanks guys!

From: Tore.Saeter@elab-runit.sintef.no

The standard book I read as an general introduction to distributed
databases was "Distributed Database Principles and Systems", Stefano
Ceri, Giuseppe Pelagatti, McGraw-Hill Book Comp., 1985.
-----------------
From: telxon!craign@uunet.uu.net (Craig Nirosky   )

Regarding your request for information about two-phase commit, etc.,
I would recommend the "classic":

           An Introduction to Database Systems
           Volume II
           C. J. Date
           Addison-Wesley Publishing Company
 
----------------------
From: Richard Bielak <richieb@bony1.bony.com>


Ha, ha. I guess these guys never heard of Murphy's Law. Anyway, there
was a descent article about TP systems in a recent issue of
"Communications of the ACM". It was the November 1990 issue. The
article was titled "Transaction Processing Monitors". 


Another reference would be any data base textbook. I personally like
the one by Ullman (sp?) called "Database Systems".

------------
From: sanjay@cs.wisc.edu (Sanjay Krishnamurthi)

  
  Unless you want to get hold of old publications the best place
  to look in any modern text that has a chapter on distr. databases.
  The book by Korth would be adequate. If you want more detail
  I suggest you look at Ceri & Pellagati. It has an extensive
  bibliography as well.

  It is Database Systems Concept : Henry Korth & A Silberschatz.
  ISBN 0-07-044752-7

> Re: [book by Ceri & Pelagatti] 

  An excellent text but a little to detailed if you only want to point
  out the potential problems.

----------------------------------
From: weems@evax.uta.edu (Bob Weems)

I have taught a graduate-level distributed DB course for the last
five years.  I would recommend the book by Ceri and Pelagatti as a
starting point, even though it is dated.  It is published by McGraw-Hill.
The recent text my Tamer Ozsu and published by Prentice-Hall is essentially
a superset of C&P and is current.

As far as hacking together a DDB, it is probably doable within a
relatively small organization, but will sacrifice many features that
will make the "distributed" adjective inappropriate.  A few thoughts.

1. Will the data be fully replicated at all sites? This trivializes
   query processing and simplifies transaction management if you
   were already assuming that some replication was needed.
2. Is full transparency to be supported.  Assuming that a relational
   model is being used, a relation may be partioned by tuples
   or columns into units called fragments.  Each fragment is
   accompanied by an expression called a qualification that describes
   its contents.  Most vendors/application developers believe
   that fragmenting into sets of tuple is more important than
   fragmenting by columns.  Nonetheless, doing this in a general
   sense requires that you are correctly manipulating the
   qualifications to minimize the amount of data being operated on.
   Simple solutions exist, but these are about as elegant as
   scanning an entire DB when one tuple is needed.
3. (Related to 2)  After fragmenting your relations according to
   anticipated queries and transactions, now you must assign copies
   to appropriate sites.  In general this can be a monstrous operations
   research problem, but if the DB software is expected to get
   substantial use the applications developers mus able to know
   how a particular assignment of copies to sites might be beneficial.
   The big boys cannot afford to move around copies frequently just
   to "get it right".
4. Query processing.  If all transactions and queries operate on a
   handful of tuple this may not be an issue.  If large sets of tuples
   are to be manipulated (as might occur even in SQL), it may be
   CRITICAL to use semijoin reduction techniques.  Even though a lot
   of theory exists on this, at least some heuristic application of
   the idea is needed.  They should look at all the options that
   the IBM R* prototype evaluates.  It's time-consuming, but worthwhile.
5. Transaction management.  Are they even doing the centralized recovery
   in a 1990's kind of way?  All of the big boys use a variation of
   a undo/redo technique that does not need a write-through the
   massive cache they are using with the disks.  If they are using
   a distributed two-phase commit, how are they handling deadlocks>
   By building a path-pushing cycle detector?  By timeouts?
   It will not take care of itself, but for ordinary systems there 
   should be a straightforward solution (it may not "bite them on the 
   rear").  I don't recall if you mentioned it, but are they at least
   starting from the source code of somebodies centralized DBMS.
   Building that is likely to be a 1-2 year project by itself.
6. Are they using a relational model?  If they are using a navigational
   model (Codasyl or IMS-like), they will get clobbered by network
   latency if they are going any distance for long transactions.
   These are workable on a lightly-loaded LAN, but long-haul speed
   is just not that of disk-to-cpu.

------------------
--
Bruce G. Barnett	barnett@crd.ge.com	uunet!crdgw1!barnett