[comp.databases] Normalized Databases

tiberio@seismo.UUCP (04/01/87)

At our facility we have a certain large investment in a relational database
design that is not in the highest normal form. We have some folks proposing
that we redesign our database (and no doubt out database interface software)
to a greater or highest normal form. They state we will enjoy the following
qualities when this is done.

1) The ammount of space required to store the data is lowest.

2) The ability to update the database is greatest.

3) Explaining the database is simplest.

While I agree with the first point, If disc space is not a problem I
don't see where this is a big win.

I'm not really sure under point two if they mean update the database
or update the database design. We update (or replace in INGRES terms)
our databases very little once they are established.

I disagree with the third point, especially if I have to explain this
twisted mess of relations (56 vs 24) to our mythical "joe" user.

What I am asking you folks in netland is for some guidance. I like the
intellectual concept of normal form relational databases, but I need to
know if there are any good reasons not to go that route. I would also
like to know if there are other advantages that the proposers failed to
mention.  I might add that we seem to have plenty of money for
hardware, but that man hours are at a premium.

Since this group has very little traffic, I don't see why we can't
discuss this on the net, but please feel free to send me E-mail for any
mega-responses. References to texts and/or papers would be welcome.

beno!tiberio
seismo!tiberio

roger@esquire.UUCP (04/01/87)

     Since you seem to be discussing an application which already
is up and running, why "improve" it?  It apparently does
what you need it to do, in a manner not inconsistant with
your resources. I'm thinking of one of our applications
which took about 8 man-years to bring up. We will not make
changes in the database scheme, period.  Simply because we
do not care to have to go hack everything to reflect the new scheme.
Rule 1: IF IT AINT BROKE, DON'T FIX IT.
Rule 2: Academic excercizes are for academics.

					     Roger Reid
					     cmcl2!esquire!roger

lien@osu-eddie.UUCP (04/02/87)

Normal form is mainly design for update anomalies. Let's  take a
trivial  example. If you design such a relation to store the
grade  of a class:

 SSN, HOMEWORK1, HOMEWORK2, MIDTERM, FINAL, AVE

Whenever you updaate the values of HOMEWORK1, HOMEWORK2, MIDTERM, 
or FINAL, you have to update the  values in AVE at the same time. 
Otherwise, you may get an inconsistent database. A good database
should not assume that every user will follow the rule
carefully. Thus, this database is dangeous.

(Actually, a good database doesn't need  AVE since it can be
caluculated. This is not a good example.)

The problem in this relation  is that AVE is transitively
dependent on the key (SSN). Normalization  theory tells us that it
is  not in 3NF.  If we can design it in 3NF or higher, 
this anomaly can be avoided. This is the point behind
normalization theory.

I don't have the complete knowledge about it. Please consult
Date's book or Ullman's book to get a better picture.

Yao-Nan  Lien

-- 
------------------------------------------------------------
Yao-Nan Lien
Department of Computer and Information Science
Ohio State University
2036, Neil Ave. Mall
Columbus, Ohio 43210-1277
Tel 614 292-5236

CSNet : lien@ohio-state.CSNET
Arpa  : lien@ohio-state.arpa
UUCP  : cbosgd!osu-eddie!lien

jeff@rtech.UUCP (04/02/87)

in article <43239@beno.seismo.CSS.GOV>, tiberio@seismo.CSS.GOV (Mike Tiberio) says:
> 
> At our facility we have a certain large investment in a relational database
> design that is not in the highest normal form. We have some folks proposing
> that we redesign our database (and no doubt out database interface software)
> to a greater or highest normal form. They state we will enjoy the following
> qualities when this is done.
> 
> 1) The amount of space required to store the data is lowest.

This is usually true, but not necessarily so.

> 2) The ability to update the database is greatest.
> 
> 3) Explaining the database is simplest.

Sometimes you can run into update problems in unnormalized databases,
because a single relationship can be stored in more than one place.
For example, suppose you are keeping track of parts, suppliers of
parts, and orders from those suppliers, and one of your tables looks like this:

	Supplier | City | Part | Color | Quantity
	-----------------------------------------
	   J1    |  NY  |  p1  | red   |   100
	   J1    |  NY  |  p2  | white |   200
	   S6    |  LA  |  p2  | white |    50
	   L3    |  DC  |  p2  | white |   300
	   L3    |  DC  |  p3  | blue  |   100

Now suppose the supplier J1 moves from NY to SF.  You will have to
update J1's city in two places.

Suppose you want to keep track of a new supplier, R2, who doesn't
currently supply you with anything.  There will have to be a row in
the table in which part, color, and quantity are blank.

Suppose supplier S6 stops supplying the part p2.  To delete that row from
the table would erase all information about S6, which you don't necessarily
want to do; instead, you must recognize the case where there is only one
row left for this supplier, and blank out the part, color, and quantity
for that row.

The problem here is normalization.  Without going into any detail on what
it means, here is the above information in third normal form:

	Supplier Table		Part Table

	Supplier | City		Part | Color
	---------------		------------
	   J1    |  NY		 p1  | red
	   S6    |  LA		 p2  | white
	   L3    |  DC		 p3  | white

	On-order Table

	Supplier | Part | Quantity
	--------------------------
	   J1    |  p1  |    100
	   J1    |  p2  |    200
	   S6    |  p2  |     50
	   L3    |  p2  |    300
	   L3    |  p3  |    100

Normalization is sometimes explained with a bunch of fancy jargon, but what
it boils down to is this:  Every table should stand for a single thing.  In
the example, the problem with the original table is that it stands for three
things: suppliers, parts, and orders.  Jamming these together into one table
causes problems when we want to update some aspect of one of the things.
A normalized database is easier to understand because the individual tables
are simpler: each one stands for only one thing.  The only thing that may
become harder is figuring out how to join the various tables together;
the main way to fight against this is to make sure that each of your tables
stands for some real thing in the problem you're trying to solve (if one is
not careful, one can come up with tables that don't model the problem, but
instead model the underlying algorithm; it's a good idea to design your
data first, and then figure out your algorithms).

This makes it sound easier than it really is.  Figuring out what your "things"
are can be difficult.

> What I am asking you folks in netland is for some guidance. I like the
> intellectual concept of normal form relational databases, but I need to
> know if there are any good reasons not to go that route.

The main argument against normalization is performance.  Normalization
involves splitting unnormalized tables up into smaller tables.  But this
means you will have to join the tables to recreate the relationships in
the original, unnormalized form.  Joins are slower than fetches from single
tables.

In the above example, suppose that you often need to know which parts
are coming from which cities.  In the normalized database, you would have to
join the supplier and the on-order tables.  If this took too much time,
you might decide to unnormalize your database and merge the two tables
into one:

	Supplier-Order Table

	Supplier | City | Part | Quantity
	---------------------------------
	   J1    |  NY  |  p1  |    100
	   J1    |  NY  |  p2  |    200
	   S6    |  LA  |  p2  |     50
	   L3    |  DC  |  p2  |    300
	   L3    |  DC  |  p3  |    100

Some of the update and understanding problems come back, but it will run
faster for the operation we're interested in.
-- 
Jeff Lichtman at rtech (Relational Technology, Inc.)
"Saints should always be judged guilty until they are proved innocent..."

{amdahl, sun}!rtech!jeff
{ucbvax, decvax}!mtxinu!rtech!jeff

phgun@violet.berkeley.edu.UUCP (04/02/87)

While I agree that in most cases it's important to at least have
the relations in 3NF, sometimes there's no neccessity for it.
Take the example where my relation is 

	 R(NAME, PHONE#, ROOM#)

where NAMEs are unique, each person has his own PHONE#, and more
than one person may be assigned to the same ROOM#. R is not in 3NF,
since ROOM# is transitively dependent on the key NAME
(i.e. NAME->PHONE#, PHONE->ROOM#, NAME->ROOM#).

By Normalization theory, we want to (non-loss) decompose R into 
R1(NAME, PHONE#) and R2(PHONE#, ROOM#). But this is only if there's
a reason to know the assignment of phones to rooms independent of
who is occupying them. Otherwise I don't see any reason why we
should worry about "losing" information about the location of
phones in rooms if we delete some name from R, if it is  of no
significance to us (e.g. we rip off the line if no one's using
it!). But if we have lots of queries pertaining to "who is in room
X?", or "which room is Y located in?", this decomposition means we
need to always JOIN R1 and R2, which is a performance LOSS.

Note  that R cannot be decomposed into R1(NAME, ROOM#) and 
R2(ROOM#, PHONE#) because this is a lossy-join decomposition,
though on the surface, they gives us a more efficient way of
executing the 2 queries above (after all they're both in 3NF,
but the joining-attribute is not a key in either relations).

---Himawan Gunadhi
   ARPA: phgun@violet.berkeley.edu
   UUCP: ...!ucbvax!jade!violet!phgun
   BITNET: phgun@ucbviolet.bitnet

billc@blia.UUCP (04/03/87)

In article <43239@beno.seismo.CSS.GOV>, tiberio@seismo.CSS.GOV (Mike Tiberio) writes:
> ... that we redesign our database 
> to a greater or highest normal form. They state we will enjoy the following
> qualities when this is done.
> 
> 1) The ammount of space required to store the data is lowest.

This is true.  However, you pay a performance penalty, which I'll talk 
about below.  It's the old space vs. speed business.

> 2) The ability to update the database is greatest.

I'm not at all sure what is meant by this.  I don't think this makes the
schema any easier to change.  You can build in some better integrity 
checking, but I don't want to go into that now, since there are other
options for those who choose non-normalized schemas (schemae?).

> 3) Explaining the database is simplest.

I also strongly disagree.  Try to write a tutorial using a heavily-normalized
schema as the example database!

Normalization is a religious issue.  I know a number of people who strongly
favor it, and it does indeed have its place.  However, I think there are
some good counter-arguments for CERTAIN situations.  (No, I am not
anti-normalization, but I am also not a normal-form worshipper.)

The biggest disadvantage is in processing time.  Modelling has shown that
the stronger the normalization, the poorer the overall performance
of the database.  More normalization means more joins.  

Another disadvantage is that a heavily normalized database can be 
difficult to tune.  There are many ways of keying together two relations.
SOme of them can get pretty slow.  It takes more adminstrative 
"cleverness" to handle a heavily-normalized schema.

And now, the good stuff.  Don't blindly stash everything in one giant
table, there are many situations where normalization pays off.
A large amount of fairly redundant information that is infrequently 
accessed is clearly a good candidate for spinning off into its own 
relation.  Normalization also gives richer semantics and greater
flexibility -- you can't do one-to-many relationships without 
normalization.

-- 
W.H.Coffin.  billc@blia.BLI.COM
 Or, if you really like source routing, try ucbvax!{mtxinu|ucsfcgl}!blia!billc
 >> the usual disclaimer about my employer and my wretched opinions. <<
 >> the usual witticisms that swell netnews to ridiculous proportions. <<

eric@hippo.UUCP (04/03/87)

	Higher normal forms are good, but only in a perfect world. It is
true that it will take less disk space (since any data item will only
be stored once), and the inherent design of the database is "cleaner".
It also makes updates easier, since the only replicated data are key
values.

	What they are not telling you is that the penalty can be performance.
Third normal form databases can quickly lead you to multi-table joins
instead of simple retrieves out of a single table.

	Being the veteran of several projects that used relational databases
(sometimes quite large and complex), I can tell you that our usual
strategy is to design the database initially to third normal form
(using ER diagrams, or whatever), and then start looking at where the
bottlenecks might be, and where replication might be in order to achieve
the necessary performance. But it is alot better to start in third normal
form and work backwards (at least then you understand and document why
you replicated a particular set of data), then to just shoot for a lower
order of normalization to start with. 

-- 

					eric
					...!ptsfa!hippo!eric

allbery@ncoast.UUCP (04/04/87)

As quoted from <43239@beno.seismo.CSS.GOV> by tiberio@seismo.CSS.GOV (Mike Tiberio):
+---------------
| At our facility we have a certain large investment in a relational database
| design that is not in the highest normal form. We have some folks proposing
>...
| 3) Explaining the database is simplest.
>...
| While I agree with the first point, If disc space is not a problem I
| don't see where this is a big win.
>...
| I disagree with the third point, especially if I have to explain this
| twisted mess of relations (56 vs 24) to our mythical "joe" user.
+---------------

It *is* easier.  Our database at TDI has very little normalization, being
basically unchanged from its original form as a FilePro-16 (early version)
database, which was not very relational at all.  In particular, the labor
tracking system for factory orders is something like:

	order --->> lineitem
	  \		|
	   \		v
	    ------>  labdesc
		      ^  |
	         ----^   v
	        /	 v
	    routing ->> labordtl
	        \	    ^
	         \	    ^
	          ------>> location

One arrow is a one-to-one, two make a one-to-many.  I also omit the item file
which joins one-to-many with the lineitem and routing tables, and a number of
other relationships.  Normalizing this adds more tables, but the resulting
relationships are MUCH easier to follow.  (I have attempted to chart the
current relationships in our application; talk about spaghetti...)

One thing that must be remembered, however:  Normalization is not ALWAYS
a good idea.  For instance, it can be argued that cities should be ``normal-
ized'' onto states, ditto for zip codes, UPS zones should be normalized to
ZIP codes, and area codes normalized to states; while tables expressing these
relationships produce an increased normalization, they are in practice a bit
ridiculous.  (Hmmm, that last sentence doesn't make much sense even to me.
Maybe I should try again?)  A practical example:  an order entry application
can have a customer mailing address, a customer shipping address, and ways
to override these addresses for specific orders (shipping) and invoices
(mailing).  Normalization says that addresses should be stored in a separate
address file in all cases and pointers used from the various tables; this
saves space in the tables.  However, in practice at least the normal
customer mailing address will be kept in the customer file rather than the
address file, as it will be the most commonly used and needs to be available
for ready reference.  (The UNIFY 3.2 Programmer's Manual contains a very
complete discussion on normalization, when to apply it, and when *not* to
use it; it does a far better job than I am....)

++Brando
-- 
 ____   ______________
/    \ / __   __   __ \   Brandon S. Allbery		|      /^\      USS
 ___  | /__> /  \ /  \    aXcess Company		|     A A A  CHALLENGER
/   \ | |    `--, `--,    6615 Center St. #A1-105	|     H V H
|     | \__/ \__/ \__/    Mentor, OH 44060-4101		|     H . H    SEVEN
\____/ \______________/   +1 216 974 9210		|    /  |  \    WHO
________________________________________________________|   /___|___\  DARED
As long as there is the spirit to dare new frontiers, the dream will never die.
cbatt!cwruecmp!ncoast!allbery ncoast!allbery%case.CSNET@relay.CS.NET BALLBERY
	    (UUCP)                      (CSNET/Internet)             (MCIMail)

vollmer@cod.UUCP (04/05/87)

in article <43239@beno.seismo.CSS.GOV>, tiberio@seismo.CSS.GOV (Mike Tiberio) 
says:
> 
> At our facility we have a certain large investment in a relational database
> design that is not in the highest normal form. We have some folks proposing
> that we redesign our database (and no doubt out database interface software)
> to a greater or highest normal form.....

A real world problem I have had with going overboard with normal forms
is that many of the DBMS 4GL application building tools have a
real problem trying to put the whole thing back together again.

You have data safely tucked away in "their own" relations and spend a
lot of time trying to kludge the application tools to try to make
reports and data entry sessions not look like they are aimed at 1001 tables.

There are real gains from normalization, but the problem is trying
to balance performance and use of the 4GL tools on the design.  That
balance is both the fun and the art of making relational work.

mjr@well.UUCP (04/05/87)

In article <43239@beno.seismo.CSS.GOV> tiberio@seismo.CSS.GOV (Mike Tiberio) writes:
>At our facility we have a certain large investment in a relational database
>design that is not in the highest normal form. We have some folks proposing
>that we redesign our database (and no doubt out database interface software)
>to a greater or highest normal form. They state we will enjoy the following
>qualities when this is done.
> *** etc. ***

Normal form is better for many of the reasons you state, AND because
when you update an attribute, you only have to do it ONCE as it only
appears in ONE table and not in many...   Also, each "thing" you are
describing in the real world, while it may appear in several tables
(say invoices in one table, and invoice lines in another), there is
a clear cut way to distinguish the rows (tuples) that belong to a
(represent) particular thing in the real world.

The two main justifications for NOT normalizing are 1) Performance
(something you query a hundred times a day say may take too long
if parts of it are scattered across many tables), and 2) recoverability
(if you loose your only copy what happens...).  For an excellent
discussion of these and other aspects of data-base design, see
Daniel Martin's ADVANCED DATABASE TECHNIQUES.  This book is about
Entity-Relationship modeling generally, but it does have several
chapters devoted to general issues such as those mentioned above!
(that's MIT Press, 1986) Or, I happen to be writing a column
on the subject of DBMSs generally in Computer Language Magazine.
Feel free to write to mjr@well or ptsfa!well!mjr, etc...

mjr@well.UUCP (04/06/87)

Well Jeff, you certainly got things stirred up here...
In general, I don't think that normalization is a "religious"
issue, there are sound technical and logical reasons for
using it.  On the other hand, "if it ain't broke don't
fix it" works if you are aware of the problems you can
get into with unnormalized relations, and prepare to
deal with them when they come up.

Date has been mentioned here but here are references to
his most famous work (really the best treatment of
the whole of relational theory I've seen)
C.J. Date  "An Introduction to Database Systems"
Addison-Wesley, 1986 (fourth Edition.) Vol I and also
Volume II which tends to deal with more "real world"
problems after the foundation is layed in Vol I...
Be well, mjr@well

billc@blia.UUCP (04/14/87)

The point of my earlier posting was simply that there is no formula that
can replace common sense in designing a schema.  There are people who
are religious about normalization in the sense that they believe in
rigorous normalization no matter what the situation.

I suppose it IS easier to explain a normalized database, given a high
level of complexity.  I was thinking, though, about tutuorial level
indoctrination into query languages.  I'll take a concrete example:
it is very hard to explain correlated subqueries
to the poor first-time SQL user.  (And indeed, they are a horrible botch.)
When describing correlated subqueries, I like to start with single-table
examples.  If the example schema is normalized, the simplest "sensible" 
example query may have 2 or 3 joins involved.  This adds a lot more "noise",
and makes the examples a good deal more difficult to comprehend.
Certainly, once a user understands the query language
a somewhat-normalized database can be easier to understand, but again,
the normalization should be pursued with common sense rather than
with formulaic rabidity :-).
-- 
W.H.Coffin.  billc@blia.BLI.COM
 Or, if you really like source routing, try ucbvax!{mtxinu|ucsfcgl}!blia!billc
 >> the usual disclaimer about my employer and my wretched opinions. <<
 >> the usual witticisms that swell netnews to ridiculous proportions. <<

mjr@well.UUCP (Matthew Rapaport) (04/18/87)

I wasn't thinking so much about teaching someone a query language
as designing a database in the first place.  Knowing the logical
advantages to normalization, it makes sense to begin that way, and
make deliberate and reasoned decisions NOT to normalize when
appropriate
yours truely
mjr@well