[comp.databases] Balancing Normalization with Performance

jamesr@cortex.med.jhu.edu (Jim Rosenberg) (11/20/90)

Hello database gurus.

I am building my first database, and am trying to understand how to balance
the benefits of normalization with performance requirements.  I am 
interested in any heuristics or literate on deciding when to scrap 
normalization to get a performance kick.  

The database will reside on a DECSystem 5400, and will be built using Ingres
(whichever version DEC is supplying run-time libraries for in Ultrix 4.x).  
We have ample memory on the system for a large database, so wasting some
memory with empty domains, for example, really isn't a problem.  I am able 
to create the database in what-I-understand-to-be fourth normal form, and I 
am familiar with the implementation of Ingres (University Ingres at least).  

"Rules" regarding trade - offs between relation size and performance, number
of relations and performance, subqueries vs. searches through large relations,
etc., would be a great help.

An example issue I am trying to answer:
  I have a domain that can have one of a given set of character string values.
  I am thinking of creating a relation that equates each string with
  an integer ID.  The benefits would be less memory used in the relation with
  that domain (since an integer would be stored for each tuple rather than
  an n byte length string), faster comparisons in a qualification (since 
  comparisons on that domain would be between integers and not strings), and
  less chance of typo's in the database (since could not enter a string that
  cannot be translated to an ID by a subquery into the ID relation).  The 
  cost, of course, is the subquery required whenever you want to access the
  domain.  Does the answer depend on the length of the strings, etc ...


Thanks in advance for all the help.
Replies by email or to this news group are just fine.

James Rosenberg  (jamesr@cortex.med.jhu.edu)
Div. of Psych. Neuro-Imaging
Johns Hopkins University

randall@informix.com (Randall Rhea) (11/21/90)

In article <1092@cortex.med.jhu.edu> jamesr@cortex.med.jhu.edu (Jim Rosenberg) writes:
>Hello database gurus.
>
>I am building my first database, and am trying to understand how to balance
>the benefits of normalization with performance requirements.  I am 
>interested in any heuristics or literate on deciding when to scrap 
>normalization to get a performance kick.  

This may be the biggest question in database design.  There is no simple
answer for it.  Whether or not to normalize is a question that can 
only be answered on a case-by-case basis.

In general, I have had difficulty with over-normalized database designs.
There are analysts who have read a lot of database books, but have little
experience in actually developing real applications ... these are the folks
that tend to over-normalize.  I find myself needing to write a fairly simple
report, but it turns into a complicated program because data is all over
the place.  As you correctly point out, one gets tired of writing multiple
sub-queries.

I could go on and on about database design theory, but you could find one
of those books that discuss it better than I could.  I will, however,
refer to your specific example:

>An example issue I am trying to answer:
>  I have a domain that can have one of a given set of character string values.
>  I am thinking of creating a relation that equates each string with
>  an integer ID.  The benefits would be less memory used in the relation with
>  that domain (since an integer would be stored for each tuple rather than
>  an n byte length string), faster comparisons in a qualification (since 
>  comparisons on that domain would be between integers and not strings), and
>  less chance of typo's in the database (since could not enter a string that
>  cannot be translated to an ID by a subquery into the ID relation).  The 
>  cost, of course, is the subquery required whenever you want to access the
>  domain.  Does the answer depend on the length of the strings, etc ...

A great example!  It shows how tricky the whole thing can be.  If there were
only 10 possible values for your string, then I would certainly
normalize and use the integer ID that you describe.  However, if there were
500 possible string values, or 5000, then I may just stick the string in
the same table as everything else and shine the integer ID.  If there
were 5000 possible values for your string, then it may not be worth the
extra table because:

1. Any programs that need to look at the string would have to perform
a sub-query.  In some report writer languages, this can be quite difficult.

2. You could increase the number of CPU and I/O resources required by
the application.   This means that your application could perform slower.

3. Your string table would have to be maintained by somebody.  Keeping
track of 5000 different string values is no trivial task.

If there were only 10 possible string values, I WOULD normalize because:

1. It could cut down on disk space requirements.

2. The string reference table would enforce standards for the strings, e.g.
spellings.

3. The system would be easier to use, since a data entry person would not be
required to type in a string; he could simply choose one from an approved list.

In database design, as in life, there are tradeoffs.

-- 

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Randall Rhea                                          Informix Software, Inc. 
Senior Programmer/Analyst, MIS                    uunet!pyramid!infmx!randall

cdm@gem-hy.Berkeley.EDU (Dale Cook) (11/22/90)

In article <1990Nov20.190315.13505@informix.com>, randall@informix.com
(Randall Rhea) writes:
|> In article <1092@cortex.med.jhu.edu> jamesr@cortex.med.jhu.edu (Jim
Rosenberg) writes:
|> >Hello database gurus.
|> >
|> >I am building my first database, and am trying to understand how to balance
|> >the benefits of normalization with performance requirements.  I am 
|> >interested in any heuristics or literate on deciding when to scrap 
|> >normalization to get a performance kick.  
|> 
|> This may be the biggest question in database design.  There is no simple
|> answer for it.  Whether or not to normalize is a question that can 
|> only be answered on a case-by-case basis.
|> 
|> In general, I have had difficulty with over-normalized database designs.
|> There are analysts who have read a lot of database books, but have little
|> experience in actually developing real applications ... these are the folks
|> that tend to over-normalize.  I find myself needing to write a fairly simple
|> report, but it turns into a complicated program because data is all over
|> the place.  As you correctly point out, one gets tired of writing multiple
|> sub-queries.
|> 
I think you are confusing design with implementation.  Database _designs_,
IMHO, should always be normalized.  You then examine the application
requirements to determine if it is best to denormalize specific aspects
of your design during implementation.  Typically, these are _performance_
requirements, *not* programmer convenience.  (However, end-user convenience 
is a legitimate requirement).  Remember that the goal of the 
design is to understand your data and represent the data structure *as it
logically exists* so that others can understand it as well, as well as the
obvious goals of minimizing the impact of future changes and maximizing
data sharability.

While I sympathize with the problems of "data all over the place", it does
not in and of itself justify denormalization (especially for programmers).
I've seen many a database where exactly that has been done, which has made
changing that database a maintenance nightmare.

|> I could go on and on about database design theory, but you could find one
|> of those books that discuss it better than I could.  I will, however,
|> refer to your specific example:
|> 
|> >An example issue I am trying to answer:
|> >  I have a domain that can have one of a given set of character
string values.
|> >  I am thinking of creating a relation that equates each string with
|> >  an integer ID.  The benefits would be less memory used in the
relation with
|> >  that domain (since an integer would be stored for each tuple rather than
|> >  an n byte length string), faster comparisons in a qualification (since 
|> >  comparisons on that domain would be between integers and not
strings), and
|> >  less chance of typo's in the database (since could not enter a
string that
|> >  cannot be translated to an ID by a subquery into the ID relation).  The 
|> >  cost, of course, is the subquery required whenever you want to access the
|> >  domain.  Does the answer depend on the length of the strings, etc ...
|> 
|> A great example!  It shows how tricky the whole thing can be.  If there were
|> only 10 possible values for your string, then I would certainly
|> normalize and use the integer ID that you describe.  However, if there were
|> 500 possible string values, or 5000, then I may just stick the string in
|> the same table as everything else and shine the integer ID.  If there
|> were 5000 possible values for your string, then it may not be worth the
|> extra table because:
|> 
|> 1. Any programs that need to look at the string would have to perform
|> a sub-query.  In some report writer languages, this can be quite difficult.
                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
So what?  That's why you get the big bucks, isn't it?
|> 
|> 2. You could increase the number of CPU and I/O resources required by
|> the application.   This means that your application could perform slower.

What type of application?  Queries may be slower, but updates may be 
significantly faster!  You have to consider the whole system, not the
piece of it you're working on today.  If your data is static, it may
be worth it.  If it's volatile, you'll pay dearly for denormalization.
|> 
|> 3. Your string table would have to be maintained by somebody.  Keeping
|> track of 5000 different string values is no trivial task.

No kidding.  But if you embed these strings in one table, what happens
to the poor slob needs those strings in another table???  Is (s)he supposed
to replicate this mess in the new table?  So much for disk savings.  And
suppose someone changes the value of one of these strings, but neglects
to change all of the other like values?  Your data is now inconsistent.
Your boss gets one value on his report, and his /her boss gets another on
theirs.  Your boss will want to know why (s)he looked like an a-hole in
the meeting this morning.
|> 
|> If there were only 10 possible string values, I WOULD normalize because:
|> 
|> 1. It could cut down on disk space requirements.

How do we know this?  The number of distinct values is of less importance
than other factors such as the number of tables related to this one and 
data volatility.
|> 
|> 2. The string reference table would enforce standards for the strings, e.g.
|> spellings.

Why would one care about spellings if one had 10 possible values, but not
care if there were 5000?
|> 
|> 3. The system would be easier to use, since a data entry person would not be
|> required to type in a string; he could simply choose one from an
approved list.
|> 
|> In database design, as in life, there are tradeoffs.
|> 
There are no tradeoffs in design, only implementation.  Programmer 
convenience is not a legitimate tradeoff, or at least pretty damn low
on the list.  Balance query time against update time.  Consider data
volatility when considering storing summary information.  Try to look ahead in
time.  Will the data be used by other applications?  Will it change?
My point is that denormalization should not be undertaken lightly.  You
should understand the consequences of it, both near and long term.
The benefits to the entire system should be clearly identified.

--- Dale Cook   cdm@inel.gov
========== long legal disclaimer follows, press n to skip ===========
^L
Neither the United States Government or the Idaho National Engineering
Laboratory or any of their employees, makes any warranty, whatsoever,
implied, or assumes any legal liability or responsibility regarding any
information, disclosed, or represents that its use would not infringe
privately owned rights.  No specific reference constitutes or implies
endorsement, recommendation, or favoring by the United States
Government or the Idaho National Engineering Laboratory.  The views and
opinions expressed herein do not necessarily reflect those of the
United States Government or the Idaho National Engineering Laboratory,
and shall not be used for advertising or product endorsement purposes.

cdm@gem-hy.Berkeley.EDU (Dale Cook) (11/22/90)

In article <33265@netnews.upenn.edu>, aaron@grad2.cis.upenn.edu (Aaron
Watters) writes:

|> 
|> ALTERNATIVELY there is the phenomenon of the UNJUSTIFIED
|> NORMALIZATION.  Where we make use of an assumption such as
|>    Persons only have one address.
|>    Husband's last name determines Wife's last name.
|> etcetera.  In this case the normalization may DISALLOW information
|> that a user may later want to insert into the database.  This
|> is another issue.			-aaron.

I don't understand what you mean by "unjustified normalization".  In 
fact, I don't think there's any such thing.  One normalizes the 
design, and denormalizes the implementation based on _rational_
decisions about performance, stability, etc.  Your examples above
are examples of incorrect data analysis, and have nothing to do with
normalization.

--- Dale Cook     cdm@inel.gov
========== long legal disclaimer follows, press n to skip ===========
^L
Neither the United States Government or the Idaho National Engineering
Laboratory or any of their employees, makes any warranty, whatsoever,
implied, or assumes any legal liability or responsibility regarding any
information, disclosed, or represents that its use would not infringe
privately owned rights.  No specific reference constitutes or implies
endorsement, recommendation, or favoring by the United States
Government or the Idaho National Engineering Laboratory.  The views and
opinions expressed herein do not necessarily reflect those of the
United States Government or the Idaho National Engineering Laboratory,
and shall not be used for advertising or product endorsement purposes.

zuker@6sigma.UUCP (Hunter Zuker) (11/22/90)

>(Randall Rhea) writes:
>>(Jim Rosenberg) writes:

>>I am building my first database, and am trying to understand how to balance
>>the benefits of normalization with performance requirements.

>This may be the biggest question in database design.  There is no simple
>answer for it.  Whether or not to normalize is a question that can 
>only be answered on a case-by-case basis.

Agreed.  When and where to de-normalize does not always have a clear
answer.

The first step in designing a good database is to model the database.
You can do this with something like Entity-Relationship diagrams or
I prefer the Extended Relational Analysis model.  From there you can
normalize. (I like third normal form, fourth normal form is generally
considered overkill.)

Once you have a normalized model, then is the time to consider what
might be intellegiently de-normalized.

Without normalization the possibility for data loss or data conflicts
increases dramatically.  Other reasons to normalize are to save disk space
and increase *overall* performance. (Integrity checks on un-normalized
databases can really slow things down.)

Of course if you normalize it also increases the likely-hood that
you've correctly thought through your database.  Which can only help
to make a better database.

>In general, I have had difficulty with over-normalized database designs.

Agreed.  But I would rather have an over-normalized database than one
that wasn't normalized or one that was not well de-normalized.

>>An example issue I am trying to answer:
>>  I am thinking of creating a relation that equates each string with
>>  an integer ID.

>A great example!  It shows how tricky the whole thing can be.  If there were
>only 10 possible values for your string, then I would certainly
>normalize and use the integer ID that you describe.  However, if there were
>500 possible string values, or 5000, then I may just stick the string in
>the same table as everything else and shine the integer ID.  If there
>were 5000 possible values for your string, then it may not be worth the
>extra table because:

>1. Any programs that need to look at the string would have to perform
>a sub-query.  In some report writer languages, this can be quite difficult.

The first sentence is the logical consequence of creating a string table.  
The second sentence is a potential limitation of the DBMS you are using.
Remember not all databases that call themselves relational are relational.

>2. You could increase the number of CPU and I/O resources required by
>the application.   This means that your application could perform slower.

The time it takes to do the sub-query needs to be weighed against any
integrity checks that are needed to validate an entry.

>3. Your string table would have to be maintained by somebody.  Keeping
>track of 5000 different string values is no trivial task.

True.  It depends on how static the string table is.  The more dynamic
the table, the less desirable it is.  Someone has to enter the 
table initially in either case.

>If there were only 10 possible string values, I WOULD normalize because:

>1. It could cut down on disk space requirements.

This is true in either case.  If you have a couple of million records,
then a string table of 5000 might not be unreasonable.  It could save
you a few MB or so.  If you only have 100, then a string table of 5000 
would be silly.

>2. The string reference table would enforce standards for the strings, e.g.
>spellings.

>3. The system would be easier to use, since a data entry person would not be
>required to type in a string; he could simply choose one from an approved list.

Both 2 and 3 are true regardless of the size of the string table.

Whether to include a zipcode table is equivalent to the above example.
A zipcode uniquely identifies a city and state.  If you have a million
or so addresses in your database, it might make sense to have a zipcode
table.  If you only have a thousand (like we do), then it is probably 
better to include city, state and zip in the address record.

>In database design, as in life, there are tradeoffs.

True.

Hunter Zuker
-- 
Hunter Zuker		Six Sigma CASE, Inc.	 13456 SE 27, Suite 210
zuker@6sigma.UUCP       (206) 643-6911           Bellevue, WA 98005-4211

zuker@6sigma.UUCP (Hunter Zuker) (11/27/90)

Aaron Watters writes:

>ALTERNATIVELY there is the phenomenon of the UNJUSTIFIED
>NORMALIZATION.  Where we make use of an assumption such as
>   Persons only have one address.
>   Husband's last name determines Wife's last name.
>etcetera.  In this case the normalization may DISALLOW information
>that a user may later want to insert into the database.  This
>is another issue.			

I would argue that these aren't "UNJUSTIFIED NORMALIZATIONS".  I would
say that the data was not modeled correctly.  

There was probably a time that it would have been an accurate model
to say that a wife's first name is an attribute of the husband.  Of course
now this is not accurate.  Exactly how this should be modeled will
depend on what you are building the database for.  

In many databases it is sufficient to build the data model to have a
person have only one address.  In others it may not be sufficient.  Once
this decision has been made, then normalization is used to determine
what the tables look like to optimize your data model.

You should normalize after you have modeled your database.

Hunter
-- 
Hunter Zuker		Six Sigma CASE, Inc.	 13456 SE 27, Suite 210
zuker@6sigma.UUCP       (206) 643-6911           Bellevue, WA 98005-4211

cdm@gem-hy.Berkeley.EDU (Dale Cook) (11/27/90)

In article <33418@netnews.upenn.edu>, aaron@grad2.cis.upenn.edu (Aaron
Watters) writes:
|> In article <1990Nov21.185907.10061@inel.gov> cdm@gem-hy.Berkeley.EDU
(Dale Cook) writes:
|> >Programmer 
|> >convenience is not a legitimate tradeoff, or at least pretty damn low
|> >on the list.  Balance query time against update time.  Consider data
|> >volatility when considering storing summary information.  Try to
look ahead in
|> >time.  Will the data be used by other applications?  Will it change?
|> >My point is that denormalization should not be undertaken lightly.  You
|> >should understand the consequences of it, both near and long term.
|> >The benefits to the entire system should be clearly identified.
|> >
|> >--- Dale Cook   cdm@inel.gov
|> 
|> On the contrary, programmer convenience is directly related
|> to the cost of maintaining and extending the database.  In
|> fact you could argue that the entire subject of database systems
|> is motivated by questions of programmer convenience.

The cost of maintaining and extending a database is far more closely
related to the design of the data than to programmer convenience.  True,
databases have many features which facilitate programmer productivity
(concurrency, locking, etc), but this does not mean that the database
is there solely for programmer convenience.

|>  (Why don't
|> we just write everthing from scratch directly on top of the file
|> system, anyway?)  Further, as database systems get more elaborate

Better yet, why normalize at all?  Just make one big flat file... :-)

|> and technical the distinction between an end-user and a programmer
|> blurs rapidly.

Wrong.  The *need* for a programmer diminishes.  Programmers are 
seldom end users (with the exception of tools used in support of
programming), but more and more end users are acquiring programming
skills.  
|> 
|> I would draw the alternate conclusion that
|> database systems should provide facilities that reduce the 
|> hassles of working with a highly normalized databases -- such
|> as good support for complex views among other things.  It is
|> not clear this is always done.			-aaron

I used the term "programmer convenience" to mean just that - _programmer_
convenience, not end user convenience.  My earlier posting on this
subject stated that end user convenience _is_ a legitimate reason for
denormalization.  Perhaps more to the point: is the primary use of the
database geared towards adhoc usage, or towards "batch" usage, where 
canned programs are the norm?  In an adhoc environment, the programmer
tends to become part of an information center, where end-user requests
are satisfied.  In this mode, the programmer becomes an extension of
the end user, and convenience may become a legitimate concern.
In a "batch" environment, where you tend to have the
classical payroll, personnel, accounting, etc., systems, programmers
are creating less customized reports for a wide distribution.  It is
in this environment where programmer convenience is low on the list.

I would agree with your alternate conclusion.  But I stand by my
contention that programmer snivelling about complicated queries is 
not a good reason, in and of itself, to denormalize.  There are more
important factors.

--- Dale Cook    cdm@inel.gov
========== long legal disclaimer follows, press n to skip ===========
^L
Neither the United States Government or the Idaho National Engineering
Laboratory or any of their employees, makes any warranty, whatsoever,
implied, or assumes any legal liability or responsibility regarding any
information, disclosed, or represents that its use would not infringe
privately owned rights.  No specific reference constitutes or implies
endorsement, recommendation, or favoring by the United States
Government or the Idaho National Engineering Laboratory.  The views and
opinions expressed herein do not necessarily reflect those of the
United States Government or the Idaho National Engineering Laboratory,
and shall not be used for advertising or product endorsement purposes.

miket@blia.BLI.COM (Mike Tossy) (11/27/90)

In article <1990Nov20.190315.13505@informix.com>, randall@informix.com (Randall Rhea) writes:
> In article <1092@cortex.med.jhu.edu> jamesr@cortex.med.jhu.edu (Jim Rosenberg) writes:
>>Hello database gurus.
>>
>>I am building my first database, and am trying to understand how to balance
>>the benefits of normalization with performance requirements.  I am 
>>interested in any heuristics or literate on deciding when to scrap 
>>normalization to get a performance kick.  
  *
  *
  *
>>An example issue I am trying to answer:
>>  I have a domain that can have one of a given set of character string values.
>>  I am thinking of creating a relation that equates each string with
>>  an integer ID.  The benefits would be less memory used in the relation with
>>  that domain (since an integer would be stored for each tuple rather than
>>  an n byte length string), faster comparisons in a qualification (since 
>>  comparisons on that domain would be between integers and not strings), and
>>  less chance of typo's in the database (since could not enter a string that
>>  cannot be translated to an ID by a subquery into the ID relation).  The 
>>  cost, of course, is the subquery required whenever you want to access the
>>  domain.  Does the answer depend on the length of the strings, etc ...
> 
> A great example!  It shows how tricky the whole thing can be.  If there were
> only 10 possible values for your string, then I would certainly
> normalize and use the integer ID that you describe. 
  *
  *
  *
> If there were only 10 possible string values, I WOULD normalize because:
  *
  *
  *

Please, please, please!  Don't call this normalization!  You are encoding!
Encoding is a very useful database techinque which has all the advantages
and disadvanteges you discuss; but it is unrelated to the process of
normalization.  

Also, normalization does not always (or even normally) result in performance
degradation.  Frequently normalization can improve system performance.

--
Mike Tossy					ShareBase Coropration
miket@blia.bli.com				14600 Wichester Blvd
(408) 378-7575 ext2200				Los Gatos, CA 95030
	(ShareBase is a subsidiary of Teradata Corportation)

randall@informix.com (Randall Rhea) (11/27/90)

In article <1990Nov21.185907.10061@inel.gov> cdm@gem-hy.Berkeley.EDU (Dale Cook) writes:
>I think you are confusing design with implementation.  Database _designs_,
>IMHO, should always be normalized. 

What is "design" and what is "implementation"?  Where does one phase begin
and the other leave off?  I've never received a consistent answer to that
question from the various analysis methods I have come across.  Perhaps 
there is a reason for this; often, there is no clear dividing line between
them.   

>Remember that the goal of the 
>design is to understand your data and represent the data structure *as it
>logically exists* so that others can understand it as well

Absolutely correct!  What I mean by "over-normalization" is a database design
that does NOT represent the logical structure of the data: normalization 
that makes the data structure *more* difficult to understand. 

>|> 1. Any programs that need to look at the string would have to perform
>|> a sub-query.  In some report writer languages, this can be quite difficult.
>                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>So what?  That's why you get the big bucks, isn't it?

So what?  I'll tell you "so what" .... You won't get the job done, that's
"so what".  In the real world, we have to complete projects on time and
under budget.  I have spent countless hours tweaking report writers to
look at the zillions of tables that held the data I needed.  You may think
that "programmer inconvenience" is trivial, but programmer labor expenses 
are often far and away the most expensive piece of a project's budget. 

Maybe you guys who work for the government don't have to worry about
such things as "costs."  (heh heh heh, a tongue-in-cheek comment from
a frustrated taxpayer ...)

In the ideal world, where some database book authors live, we have perfect
computers, unlimited disk space, plenty of good programmers, unlimited time,
unlimited funds, and best of all ... a perfect RDBMS tool for the 
development of the system.  In the real world, we never have any of these
things.  I know that it's unfortunate when one must change the database
design (or "implementation") to make up for a bad programming tool. (or
bad programmers!) However, this is sometimes a painful reality.  Hey, you 
gotta get the $#@&# program written using the #&@!* lousy report writer.  
Sometimes, this means the database design must be simplified.  If it isn't,
you don't get the job done.  I know, you can always get another report
writer- provided you have the funds to do so.  If not, you're stuck.

My point is that your programming staff and your programming tools must
be taken into consideration when deciding how the database is to be
designed/implemented.  It's not as trivial as Dale thinks.

Overall, though, Dale makes a lot of valid points.  Database design/
implementation is not simple.  The postings on this net make this
painfully clear.

-- 

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Randall Rhea                                          Informix Software, Inc. 
Senior Programmer/Analyst, MIS                    uunet!pyramid!infmx!randall

cdm@gem-hy.Berkeley.EDU (Dale Cook) (11/28/90)

In article <1990Nov27.030450.6251@informix.com>, randall@informix.com
(Randall Rhea) writes:
|> In article <1990Nov21.185907.10061@inel.gov> cdm@gem-hy.Berkeley.EDU
(Dale Cook) writes:
|> >I think you are confusing design with implementation.  Database _designs_,
|> >IMHO, should always be normalized. 
|> 
|> What is "design" and what is "implementation"?  Where does one phase begin
|> and the other leave off?  I've never received a consistent answer to that
|> question from the various analysis methods I have come across.  Perhaps 
|> there is a reason for this; often, there is no clear dividing line between
|> them.
I wondered when someone would catch this.  The point I was trying to get
across is that you need to seperate logical design considerations (e.g.,
normalization) from implementation considerations (e.g., denormalization).
Fully normalize first, THEN _consider_ denormalization.   
|> 
|> >Remember that the goal of the 
|> >design is to understand your data and represent the data structure *as it
|> >logically exists* so that others can understand it as well
|> 
|> Absolutely correct!  What I mean by "over-normalization" is a
database design
|> that does NOT represent the logical structure of the data: normalization 
|> that makes the data structure *more* difficult to understand. 
|> 
I still say a normalized design DOES represent the logical structure of the
data.  If it makes the structure more difficult to understand, it is because
it is counter-intuitive to the "way it's always been done".  Programmers 
have historically taken a process view; it is process considerations which
generally force denormalization.
 
|> >|> 1. Any programs that need to look at the string would have to perform
|> >|> a sub-query.  In some report writer languages, this can be quite
difficult.
|> >                                                 
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|> >So what?  That's why you get the big bucks, isn't it?
|> 
|> So what?  I'll tell you "so what" .... You won't get the job done, that's
|> "so what".  In the real world, we have to complete projects on time and
|> under budget.  I have spent countless hours tweaking report writers to
|> look at the zillions of tables that held the data I needed.  You may think
|> that "programmer inconvenience" is trivial, but programmer labor expenses 
|> are often far and away the most expensive piece of a project's budget. 
|> 
To get projects done on time and under budget, you need proper estimation.
This means taking your environment into consideration.  If you have 
schedules imposed without your input, well, that's a management problem.
I absolutely agree that labor is the largest cost element in a project.
But don't forget, the largest cost element in the life cycle of a project
is maintenance.  In the narrow short term view, your concerns are valid.
In the long term, they are often proven trivial.  (BTW, my "so what" was
somewhat tongue-in-cheek as well, but I forgot the smiley :-( ).
|> 
|> In the ideal world, where some database book authors live, we have perfect
|> computers, unlimited disk space, plenty of good programmers, unlimited time,
|> unlimited funds, and best of all ... a perfect RDBMS tool for the 
|> development of the system.  In the real world, we never have any of these
|> things.  I know that it's unfortunate when one must change the database
|> design (or "implementation") to make up for a bad programming tool. (or
|> bad programmers!) However, this is sometimes a painful reality.  Hey, you 
|> gotta get the $#@&# program written using the #&@!* lousy report writer.  
|> Sometimes, this means the database design must be simplified.  If it isn't,
|> you don't get the job done.  I know, you can always get another report
|> writer- provided you have the funds to do so.  If not, you're stuck.
|> 
|> My point is that your programming staff and your programming tools must
|> be taken into consideration when deciding how the database is to be
|> designed/implemented.  It's not as trivial as Dale thinks.
|>
Absolutely.  I think we are disagreeing on what constitutes programmer
convenience, and the relative importance of same.  I don't define 
convenience to be language shortcommings.  This is a serious 
deficiency and must be considered.  I do, however, define convenience
to include such nonsense as "Just put the precipitation description in
the flight table.  You only have one at a time, right?  That way, I
don't have to access another table...".  Don't laugh- it happens.
I do not really mean to trivialize legitimate programmer concerns.
I've been on both sides of the fence.  I have equivalent problems 
with DBA's who "won't bastardize their pristine databases" under any
circumstances.  Neither of these views advance the company good.

|> Overall, though, Dale makes a lot of valid points.  Database design/
|> implementation is not simple.  The postings on this net make this
|> painfully clear.
|> 

All of Randall's points are valid.  The painful part is assigning
relative importance to ALL factors, over the life cycle of the project.
Programmers, given reasonable tools to work with, are the best equipped
to deal with complex queries.  Shouldn't their objections to the
processing complexities imposed by normalization be given less weight
than, say, the concern over the long term sharability of the data?

---Dale Cook     cdm@inel.gov
========== long legal disclaimer follows, press n to skip ===========
^L
Neither the United States Government or the Idaho National Engineering
Laboratory or any of their employees, makes any warranty, whatsoever,
implied, or assumes any legal liability or responsibility regarding any
information, disclosed, or represents that its use would not infringe
privately owned rights.  No specific reference constitutes or implies
endorsement, recommendation, or favoring by the United States
Government or the Idaho National Engineering Laboratory.  The views and
opinions expressed herein do not necessarily reflect those of the
United States Government or the Idaho National Engineering Laboratory,
and shall not be used for advertising or product endorsement purposes.

pavlov@canisius.UUCP (Greg Pavlov) (11/29/90)

In article <355@6sigma.UUCP>, zuker@6sigma.UUCP (Hunter Zuker) writes:
> Aaron Watters writes:
> 
>ALTERNATIVELY there is the phenomenon of the UNJUSTIFIED
>NORMALIZATION.  Where we make use of an assumption such as
>   Persons only have one address.
>   Husband's last name determines Wife's last name.
>   etcetera.  In this case the normalization may DISALLOW information
>   that a user may later want to insert into the database.  This
>   is another issue.			
> 
  These are "unjustified" only in the sense that there was no justification
  for the DBA/designer to use these assumptions in the first place.  The
  last time I had to set up a data base that pertained to family units, I
  could not make any asumptions about shared last names, addresses, telephones,
  etc.  E.g., I had to take into account instances such as a family with four
  children where only three of the members had the same last name (the wife and
  two of the children).  The Eisenhower era ended 30 years ago.

  greg pavlov, fstrf, amherst, ny
  pavlov@stewart.fstrf.org

cdm@gem-hy.Berkeley.EDU (Dale Cook) (11/30/90)

In article <2020@meaddata.meaddata.com>, gordon@meaddata.com (Gordon
Edwards) writes:
|> 
|> IMHO, unjustified normalization refers to normalization for the sake of
|> normalization.  For example, getting a design in 3NF is adequate to account
|> for almost all anomolies.  Knowing that further normalization will lead to 
|> more joins to get any useful work done,  I usually stop at 3NF.  Remember

If what you mean by "normalization for the sake of normalization" is 
"normalization beyond the point of diminishing return", I'll buy that.
But you can do that in going from 2NF to 3NF as well.  It all depends
on understanding your application.  Ensuring 4 or 5NF will probably
only pay off for large, long-lived and/or extremely complex databases.

|> economies of scale, taking a design from 3NF to BCNF will probably take more
|> work that getting the design in 3NF to start with and will almost certainly
|> incurr a greater performance penalty.
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is not generally true.  4NF is usually pretty subtle, and involves
improper attribution.  The first half of your sentence really says it.  
It is often not worth the effort to _ensure_ 4NF.  
|> 
|> In fact, most of the design courses that were developed by DBMS vendors (at
|> least that I have seen) only teach normalization to 3NF.
|> 

This is because 3NF *USUALLY* implies 4NF & 5NF.

Remeber:  We normalize to maximize integrity and stability, and minimize
data redundancy.  We denormalize _with caution_ for environmental reasons,
which should be justified through weighing the alternatives and knowing
the possible impacts.

--- Dale Cook   cdm@inel.gov 
========== long legal disclaimer follows, press n to skip ===========
^L
Neither the United States Government or the Idaho National Engineering
Laboratory or any of their employees, makes any warranty, whatsoever,
implied, or assumes any legal liability or responsibility regarding any
information, disclosed, or represents that its use would not infringe
privately owned rights.  No specific reference constitutes or implies
endorsement, recommendation, or favoring by the United States
Government or the Idaho National Engineering Laboratory.  The views and
opinions expressed herein do not necessarily reflect those of the
United States Government or the Idaho National Engineering Laboratory,
and shall not be used for advertising or product endorsement purposes.

tim@ohday.sybase.com (Tim Wood) (12/01/90)

In article <33418@netnews.upenn.edu> aaron@grad1.cis.upenn.edu (Aaron Watters) writes:
>In article <1990Nov21.185907.10061@inel.gov> cdm@gem-hy.Berkeley.EDU (Dale Cook) writes:
>
>On the contrary, programmer convenience is directly related
>to the cost of maintaining and extending the database.  In
>fact you could argue that the entire subject of database systems
>is motivated by questions of programmer convenience.  
>(Why don't
>we just write everthing from scratch directly on top of the file
>system, anyway?)  

I don't agree.  For large multi-user databases, the emphasis should be
on making life easy for the end-users at the remote terminals spread 
throughout the service area, eg. bank tellers.  Making the teller's 
job of using the computer easy means that transactions will be generated
faster during busy periods and the business will be more productive.  

I'd say programmer convenience is a subordinate priority to end-user
convenience, because there are many more users than programmers of
large databases.  Also, programmer convenience will be determined more
by the power of the application development tools than by database
features.  After all, 4GL applications could manage that flat-file 
database, though data integrity would be very exposed; users could
still be spared the resulting ugliness :-).

Database features should be more oriented toward preserving data integrity.
Of course, there are easy and hard ways to do that.  For a programmer,
preserving referential integrity via a trigger is the hard way, via a
declaration is the easy way.  However, the choice could be hidden in a
application tool which either generated a template referential
integrity trigger and substituted specific object names, or simply
generated the referential declaration.  I feel the declarative approach
is better, because it is simpler at any level.  This makes it the
product vendor's responsibility to handle the complexity of building
declarative r.i. into the product internals.  Complexity doesn't go
away, it is just moved around.  The idea is to leverage it.

>Further, as database systems get more elaborate
>and technical the distinction between an end-user and a programmer
>blurs rapidly.

Not if it's designed properly.  As a system gets more features, it's 
going to be harder to use at some level because of the need to select
among the features.  But the system should allow hiding of that complexity
at appropriate levels so that one only need know a few simple things
to do something simple.  The larger the learning investment a user needs
to make in a system before doing the first unit of useful work, the 
less optimal is that system's design.

>I would draw the alternate conclusion that
>database systems should provide facilities that reduce the 
>hassles of working with a highly normalized databases -- such
>as good support for complex views among other things.  It is
>not clear this is always done.			-aaron

Agreed (funny, since I disagreed with your premises :-).  
That's why adding complexity is costly--to do it right, you have
to concurrently create schemes for hiding it.  
-TW

Sybase, Inc. / 6475 Christie Ave. / Emeryville, CA / 94608	  415-596-3500
tim@sybase.com          {pacbell,pyramid,sun,{uunet,ucbvax}!mtxinu}!sybase!tim
Dis claim er dat claim, what's da difference?  I'm da one doin da tawkin' hea.