barnett@grymoire.crd.ge.com (Bruce Barnett) (01/31/91)
Here is the summary I got. Thanks guys!
From: Tore.Saeter@elab-runit.sintef.no
The standard book I read as an general introduction to distributed
databases was "Distributed Database Principles and Systems", Stefano
Ceri, Giuseppe Pelagatti, McGraw-Hill Book Comp., 1985.
-----------------
From: telxon!craign@uunet.uu.net (Craig Nirosky )
Regarding your request for information about two-phase commit, etc.,
I would recommend the "classic":
An Introduction to Database Systems
Volume II
C. J. Date
Addison-Wesley Publishing Company
----------------------
From: Richard Bielak <richieb@bony1.bony.com>
Ha, ha. I guess these guys never heard of Murphy's Law. Anyway, there
was a descent article about TP systems in a recent issue of
"Communications of the ACM". It was the November 1990 issue. The
article was titled "Transaction Processing Monitors".
Another reference would be any data base textbook. I personally like
the one by Ullman (sp?) called "Database Systems".
------------
From: sanjay@cs.wisc.edu (Sanjay Krishnamurthi)
Unless you want to get hold of old publications the best place
to look in any modern text that has a chapter on distr. databases.
The book by Korth would be adequate. If you want more detail
I suggest you look at Ceri & Pellagati. It has an extensive
bibliography as well.
It is Database Systems Concept : Henry Korth & A Silberschatz.
ISBN 0-07-044752-7
> Re: [book by Ceri & Pelagatti]
An excellent text but a little to detailed if you only want to point
out the potential problems.
----------------------------------
From: weems@evax.uta.edu (Bob Weems)
I have taught a graduate-level distributed DB course for the last
five years. I would recommend the book by Ceri and Pelagatti as a
starting point, even though it is dated. It is published by McGraw-Hill.
The recent text my Tamer Ozsu and published by Prentice-Hall is essentially
a superset of C&P and is current.
As far as hacking together a DDB, it is probably doable within a
relatively small organization, but will sacrifice many features that
will make the "distributed" adjective inappropriate. A few thoughts.
1. Will the data be fully replicated at all sites? This trivializes
query processing and simplifies transaction management if you
were already assuming that some replication was needed.
2. Is full transparency to be supported. Assuming that a relational
model is being used, a relation may be partioned by tuples
or columns into units called fragments. Each fragment is
accompanied by an expression called a qualification that describes
its contents. Most vendors/application developers believe
that fragmenting into sets of tuple is more important than
fragmenting by columns. Nonetheless, doing this in a general
sense requires that you are correctly manipulating the
qualifications to minimize the amount of data being operated on.
Simple solutions exist, but these are about as elegant as
scanning an entire DB when one tuple is needed.
3. (Related to 2) After fragmenting your relations according to
anticipated queries and transactions, now you must assign copies
to appropriate sites. In general this can be a monstrous operations
research problem, but if the DB software is expected to get
substantial use the applications developers mus able to know
how a particular assignment of copies to sites might be beneficial.
The big boys cannot afford to move around copies frequently just
to "get it right".
4. Query processing. If all transactions and queries operate on a
handful of tuple this may not be an issue. If large sets of tuples
are to be manipulated (as might occur even in SQL), it may be
CRITICAL to use semijoin reduction techniques. Even though a lot
of theory exists on this, at least some heuristic application of
the idea is needed. They should look at all the options that
the IBM R* prototype evaluates. It's time-consuming, but worthwhile.
5. Transaction management. Are they even doing the centralized recovery
in a 1990's kind of way? All of the big boys use a variation of
a undo/redo technique that does not need a write-through the
massive cache they are using with the disks. If they are using
a distributed two-phase commit, how are they handling deadlocks>
By building a path-pushing cycle detector? By timeouts?
It will not take care of itself, but for ordinary systems there
should be a straightforward solution (it may not "bite them on the
rear"). I don't recall if you mentioned it, but are they at least
starting from the source code of somebodies centralized DBMS.
Building that is likely to be a 1-2 year project by itself.
6. Are they using a relational model? If they are using a navigational
model (Codasyl or IMS-like), they will get clobbered by network
latency if they are going any distance for long transactions.
These are workable on a lightly-loaded LAN, but long-haul speed
is just not that of disk-to-cpu.
------------------
--
Bruce G. Barnett barnett@crd.ge.com uunet!crdgw1!barnett