jamesr@cortex.med.jhu.edu (Jim Rosenberg) (11/20/90)
Hello database gurus. I am building my first database, and am trying to understand how to balance the benefits of normalization with performance requirements. I am interested in any heuristics or literate on deciding when to scrap normalization to get a performance kick. The database will reside on a DECSystem 5400, and will be built using Ingres (whichever version DEC is supplying run-time libraries for in Ultrix 4.x). We have ample memory on the system for a large database, so wasting some memory with empty domains, for example, really isn't a problem. I am able to create the database in what-I-understand-to-be fourth normal form, and I am familiar with the implementation of Ingres (University Ingres at least). "Rules" regarding trade - offs between relation size and performance, number of relations and performance, subqueries vs. searches through large relations, etc., would be a great help. An example issue I am trying to answer: I have a domain that can have one of a given set of character string values. I am thinking of creating a relation that equates each string with an integer ID. The benefits would be less memory used in the relation with that domain (since an integer would be stored for each tuple rather than an n byte length string), faster comparisons in a qualification (since comparisons on that domain would be between integers and not strings), and less chance of typo's in the database (since could not enter a string that cannot be translated to an ID by a subquery into the ID relation). The cost, of course, is the subquery required whenever you want to access the domain. Does the answer depend on the length of the strings, etc ... Thanks in advance for all the help. Replies by email or to this news group are just fine. James Rosenberg (jamesr@cortex.med.jhu.edu) Div. of Psych. Neuro-Imaging Johns Hopkins University
randall@informix.com (Randall Rhea) (11/21/90)
In article <1092@cortex.med.jhu.edu> jamesr@cortex.med.jhu.edu (Jim Rosenberg) writes: >Hello database gurus. > >I am building my first database, and am trying to understand how to balance >the benefits of normalization with performance requirements. I am >interested in any heuristics or literate on deciding when to scrap >normalization to get a performance kick. This may be the biggest question in database design. There is no simple answer for it. Whether or not to normalize is a question that can only be answered on a case-by-case basis. In general, I have had difficulty with over-normalized database designs. There are analysts who have read a lot of database books, but have little experience in actually developing real applications ... these are the folks that tend to over-normalize. I find myself needing to write a fairly simple report, but it turns into a complicated program because data is all over the place. As you correctly point out, one gets tired of writing multiple sub-queries. I could go on and on about database design theory, but you could find one of those books that discuss it better than I could. I will, however, refer to your specific example: >An example issue I am trying to answer: > I have a domain that can have one of a given set of character string values. > I am thinking of creating a relation that equates each string with > an integer ID. The benefits would be less memory used in the relation with > that domain (since an integer would be stored for each tuple rather than > an n byte length string), faster comparisons in a qualification (since > comparisons on that domain would be between integers and not strings), and > less chance of typo's in the database (since could not enter a string that > cannot be translated to an ID by a subquery into the ID relation). The > cost, of course, is the subquery required whenever you want to access the > domain. Does the answer depend on the length of the strings, etc ... A great example! It shows how tricky the whole thing can be. If there were only 10 possible values for your string, then I would certainly normalize and use the integer ID that you describe. However, if there were 500 possible string values, or 5000, then I may just stick the string in the same table as everything else and shine the integer ID. If there were 5000 possible values for your string, then it may not be worth the extra table because: 1. Any programs that need to look at the string would have to perform a sub-query. In some report writer languages, this can be quite difficult. 2. You could increase the number of CPU and I/O resources required by the application. This means that your application could perform slower. 3. Your string table would have to be maintained by somebody. Keeping track of 5000 different string values is no trivial task. If there were only 10 possible string values, I WOULD normalize because: 1. It could cut down on disk space requirements. 2. The string reference table would enforce standards for the strings, e.g. spellings. 3. The system would be easier to use, since a data entry person would not be required to type in a string; he could simply choose one from an approved list. In database design, as in life, there are tradeoffs. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Randall Rhea Informix Software, Inc. Senior Programmer/Analyst, MIS uunet!pyramid!infmx!randall
cdm@gem-hy.Berkeley.EDU (Dale Cook) (11/22/90)
In article <1990Nov20.190315.13505@informix.com>, randall@informix.com (Randall Rhea) writes: |> In article <1092@cortex.med.jhu.edu> jamesr@cortex.med.jhu.edu (Jim Rosenberg) writes: |> >Hello database gurus. |> > |> >I am building my first database, and am trying to understand how to balance |> >the benefits of normalization with performance requirements. I am |> >interested in any heuristics or literate on deciding when to scrap |> >normalization to get a performance kick. |> |> This may be the biggest question in database design. There is no simple |> answer for it. Whether or not to normalize is a question that can |> only be answered on a case-by-case basis. |> |> In general, I have had difficulty with over-normalized database designs. |> There are analysts who have read a lot of database books, but have little |> experience in actually developing real applications ... these are the folks |> that tend to over-normalize. I find myself needing to write a fairly simple |> report, but it turns into a complicated program because data is all over |> the place. As you correctly point out, one gets tired of writing multiple |> sub-queries. |> I think you are confusing design with implementation. Database _designs_, IMHO, should always be normalized. You then examine the application requirements to determine if it is best to denormalize specific aspects of your design during implementation. Typically, these are _performance_ requirements, *not* programmer convenience. (However, end-user convenience is a legitimate requirement). Remember that the goal of the design is to understand your data and represent the data structure *as it logically exists* so that others can understand it as well, as well as the obvious goals of minimizing the impact of future changes and maximizing data sharability. While I sympathize with the problems of "data all over the place", it does not in and of itself justify denormalization (especially for programmers). I've seen many a database where exactly that has been done, which has made changing that database a maintenance nightmare. |> I could go on and on about database design theory, but you could find one |> of those books that discuss it better than I could. I will, however, |> refer to your specific example: |> |> >An example issue I am trying to answer: |> > I have a domain that can have one of a given set of character string values. |> > I am thinking of creating a relation that equates each string with |> > an integer ID. The benefits would be less memory used in the relation with |> > that domain (since an integer would be stored for each tuple rather than |> > an n byte length string), faster comparisons in a qualification (since |> > comparisons on that domain would be between integers and not strings), and |> > less chance of typo's in the database (since could not enter a string that |> > cannot be translated to an ID by a subquery into the ID relation). The |> > cost, of course, is the subquery required whenever you want to access the |> > domain. Does the answer depend on the length of the strings, etc ... |> |> A great example! It shows how tricky the whole thing can be. If there were |> only 10 possible values for your string, then I would certainly |> normalize and use the integer ID that you describe. However, if there were |> 500 possible string values, or 5000, then I may just stick the string in |> the same table as everything else and shine the integer ID. If there |> were 5000 possible values for your string, then it may not be worth the |> extra table because: |> |> 1. Any programs that need to look at the string would have to perform |> a sub-query. In some report writer languages, this can be quite difficult. ^^^^^^^^^^^^^^^^^^^^^^^^^^^ So what? That's why you get the big bucks, isn't it? |> |> 2. You could increase the number of CPU and I/O resources required by |> the application. This means that your application could perform slower. What type of application? Queries may be slower, but updates may be significantly faster! You have to consider the whole system, not the piece of it you're working on today. If your data is static, it may be worth it. If it's volatile, you'll pay dearly for denormalization. |> |> 3. Your string table would have to be maintained by somebody. Keeping |> track of 5000 different string values is no trivial task. No kidding. But if you embed these strings in one table, what happens to the poor slob needs those strings in another table??? Is (s)he supposed to replicate this mess in the new table? So much for disk savings. And suppose someone changes the value of one of these strings, but neglects to change all of the other like values? Your data is now inconsistent. Your boss gets one value on his report, and his /her boss gets another on theirs. Your boss will want to know why (s)he looked like an a-hole in the meeting this morning. |> |> If there were only 10 possible string values, I WOULD normalize because: |> |> 1. It could cut down on disk space requirements. How do we know this? The number of distinct values is of less importance than other factors such as the number of tables related to this one and data volatility. |> |> 2. The string reference table would enforce standards for the strings, e.g. |> spellings. Why would one care about spellings if one had 10 possible values, but not care if there were 5000? |> |> 3. The system would be easier to use, since a data entry person would not be |> required to type in a string; he could simply choose one from an approved list. |> |> In database design, as in life, there are tradeoffs. |> There are no tradeoffs in design, only implementation. Programmer convenience is not a legitimate tradeoff, or at least pretty damn low on the list. Balance query time against update time. Consider data volatility when considering storing summary information. Try to look ahead in time. Will the data be used by other applications? Will it change? My point is that denormalization should not be undertaken lightly. You should understand the consequences of it, both near and long term. The benefits to the entire system should be clearly identified. --- Dale Cook cdm@inel.gov ========== long legal disclaimer follows, press n to skip =========== ^L Neither the United States Government or the Idaho National Engineering Laboratory or any of their employees, makes any warranty, whatsoever, implied, or assumes any legal liability or responsibility regarding any information, disclosed, or represents that its use would not infringe privately owned rights. No specific reference constitutes or implies endorsement, recommendation, or favoring by the United States Government or the Idaho National Engineering Laboratory. The views and opinions expressed herein do not necessarily reflect those of the United States Government or the Idaho National Engineering Laboratory, and shall not be used for advertising or product endorsement purposes.
cdm@gem-hy.Berkeley.EDU (Dale Cook) (11/22/90)
In article <33265@netnews.upenn.edu>, aaron@grad2.cis.upenn.edu (Aaron Watters) writes: |> |> ALTERNATIVELY there is the phenomenon of the UNJUSTIFIED |> NORMALIZATION. Where we make use of an assumption such as |> Persons only have one address. |> Husband's last name determines Wife's last name. |> etcetera. In this case the normalization may DISALLOW information |> that a user may later want to insert into the database. This |> is another issue. -aaron. I don't understand what you mean by "unjustified normalization". In fact, I don't think there's any such thing. One normalizes the design, and denormalizes the implementation based on _rational_ decisions about performance, stability, etc. Your examples above are examples of incorrect data analysis, and have nothing to do with normalization. --- Dale Cook cdm@inel.gov ========== long legal disclaimer follows, press n to skip =========== ^L Neither the United States Government or the Idaho National Engineering Laboratory or any of their employees, makes any warranty, whatsoever, implied, or assumes any legal liability or responsibility regarding any information, disclosed, or represents that its use would not infringe privately owned rights. No specific reference constitutes or implies endorsement, recommendation, or favoring by the United States Government or the Idaho National Engineering Laboratory. The views and opinions expressed herein do not necessarily reflect those of the United States Government or the Idaho National Engineering Laboratory, and shall not be used for advertising or product endorsement purposes.
zuker@6sigma.UUCP (Hunter Zuker) (11/22/90)
>(Randall Rhea) writes: >>(Jim Rosenberg) writes: >>I am building my first database, and am trying to understand how to balance >>the benefits of normalization with performance requirements. >This may be the biggest question in database design. There is no simple >answer for it. Whether or not to normalize is a question that can >only be answered on a case-by-case basis. Agreed. When and where to de-normalize does not always have a clear answer. The first step in designing a good database is to model the database. You can do this with something like Entity-Relationship diagrams or I prefer the Extended Relational Analysis model. From there you can normalize. (I like third normal form, fourth normal form is generally considered overkill.) Once you have a normalized model, then is the time to consider what might be intellegiently de-normalized. Without normalization the possibility for data loss or data conflicts increases dramatically. Other reasons to normalize are to save disk space and increase *overall* performance. (Integrity checks on un-normalized databases can really slow things down.) Of course if you normalize it also increases the likely-hood that you've correctly thought through your database. Which can only help to make a better database. >In general, I have had difficulty with over-normalized database designs. Agreed. But I would rather have an over-normalized database than one that wasn't normalized or one that was not well de-normalized. >>An example issue I am trying to answer: >> I am thinking of creating a relation that equates each string with >> an integer ID. >A great example! It shows how tricky the whole thing can be. If there were >only 10 possible values for your string, then I would certainly >normalize and use the integer ID that you describe. However, if there were >500 possible string values, or 5000, then I may just stick the string in >the same table as everything else and shine the integer ID. If there >were 5000 possible values for your string, then it may not be worth the >extra table because: >1. Any programs that need to look at the string would have to perform >a sub-query. In some report writer languages, this can be quite difficult. The first sentence is the logical consequence of creating a string table. The second sentence is a potential limitation of the DBMS you are using. Remember not all databases that call themselves relational are relational. >2. You could increase the number of CPU and I/O resources required by >the application. This means that your application could perform slower. The time it takes to do the sub-query needs to be weighed against any integrity checks that are needed to validate an entry. >3. Your string table would have to be maintained by somebody. Keeping >track of 5000 different string values is no trivial task. True. It depends on how static the string table is. The more dynamic the table, the less desirable it is. Someone has to enter the table initially in either case. >If there were only 10 possible string values, I WOULD normalize because: >1. It could cut down on disk space requirements. This is true in either case. If you have a couple of million records, then a string table of 5000 might not be unreasonable. It could save you a few MB or so. If you only have 100, then a string table of 5000 would be silly. >2. The string reference table would enforce standards for the strings, e.g. >spellings. >3. The system would be easier to use, since a data entry person would not be >required to type in a string; he could simply choose one from an approved list. Both 2 and 3 are true regardless of the size of the string table. Whether to include a zipcode table is equivalent to the above example. A zipcode uniquely identifies a city and state. If you have a million or so addresses in your database, it might make sense to have a zipcode table. If you only have a thousand (like we do), then it is probably better to include city, state and zip in the address record. >In database design, as in life, there are tradeoffs. True. Hunter Zuker -- Hunter Zuker Six Sigma CASE, Inc. 13456 SE 27, Suite 210 zuker@6sigma.UUCP (206) 643-6911 Bellevue, WA 98005-4211
zuker@6sigma.UUCP (Hunter Zuker) (11/27/90)
Aaron Watters writes: >ALTERNATIVELY there is the phenomenon of the UNJUSTIFIED >NORMALIZATION. Where we make use of an assumption such as > Persons only have one address. > Husband's last name determines Wife's last name. >etcetera. In this case the normalization may DISALLOW information >that a user may later want to insert into the database. This >is another issue. I would argue that these aren't "UNJUSTIFIED NORMALIZATIONS". I would say that the data was not modeled correctly. There was probably a time that it would have been an accurate model to say that a wife's first name is an attribute of the husband. Of course now this is not accurate. Exactly how this should be modeled will depend on what you are building the database for. In many databases it is sufficient to build the data model to have a person have only one address. In others it may not be sufficient. Once this decision has been made, then normalization is used to determine what the tables look like to optimize your data model. You should normalize after you have modeled your database. Hunter -- Hunter Zuker Six Sigma CASE, Inc. 13456 SE 27, Suite 210 zuker@6sigma.UUCP (206) 643-6911 Bellevue, WA 98005-4211
cdm@gem-hy.Berkeley.EDU (Dale Cook) (11/27/90)
In article <33418@netnews.upenn.edu>, aaron@grad2.cis.upenn.edu (Aaron Watters) writes: |> In article <1990Nov21.185907.10061@inel.gov> cdm@gem-hy.Berkeley.EDU (Dale Cook) writes: |> >Programmer |> >convenience is not a legitimate tradeoff, or at least pretty damn low |> >on the list. Balance query time against update time. Consider data |> >volatility when considering storing summary information. Try to look ahead in |> >time. Will the data be used by other applications? Will it change? |> >My point is that denormalization should not be undertaken lightly. You |> >should understand the consequences of it, both near and long term. |> >The benefits to the entire system should be clearly identified. |> > |> >--- Dale Cook cdm@inel.gov |> |> On the contrary, programmer convenience is directly related |> to the cost of maintaining and extending the database. In |> fact you could argue that the entire subject of database systems |> is motivated by questions of programmer convenience. The cost of maintaining and extending a database is far more closely related to the design of the data than to programmer convenience. True, databases have many features which facilitate programmer productivity (concurrency, locking, etc), but this does not mean that the database is there solely for programmer convenience. |> (Why don't |> we just write everthing from scratch directly on top of the file |> system, anyway?) Further, as database systems get more elaborate Better yet, why normalize at all? Just make one big flat file... :-) |> and technical the distinction between an end-user and a programmer |> blurs rapidly. Wrong. The *need* for a programmer diminishes. Programmers are seldom end users (with the exception of tools used in support of programming), but more and more end users are acquiring programming skills. |> |> I would draw the alternate conclusion that |> database systems should provide facilities that reduce the |> hassles of working with a highly normalized databases -- such |> as good support for complex views among other things. It is |> not clear this is always done. -aaron I used the term "programmer convenience" to mean just that - _programmer_ convenience, not end user convenience. My earlier posting on this subject stated that end user convenience _is_ a legitimate reason for denormalization. Perhaps more to the point: is the primary use of the database geared towards adhoc usage, or towards "batch" usage, where canned programs are the norm? In an adhoc environment, the programmer tends to become part of an information center, where end-user requests are satisfied. In this mode, the programmer becomes an extension of the end user, and convenience may become a legitimate concern. In a "batch" environment, where you tend to have the classical payroll, personnel, accounting, etc., systems, programmers are creating less customized reports for a wide distribution. It is in this environment where programmer convenience is low on the list. I would agree with your alternate conclusion. But I stand by my contention that programmer snivelling about complicated queries is not a good reason, in and of itself, to denormalize. There are more important factors. --- Dale Cook cdm@inel.gov ========== long legal disclaimer follows, press n to skip =========== ^L Neither the United States Government or the Idaho National Engineering Laboratory or any of their employees, makes any warranty, whatsoever, implied, or assumes any legal liability or responsibility regarding any information, disclosed, or represents that its use would not infringe privately owned rights. No specific reference constitutes or implies endorsement, recommendation, or favoring by the United States Government or the Idaho National Engineering Laboratory. The views and opinions expressed herein do not necessarily reflect those of the United States Government or the Idaho National Engineering Laboratory, and shall not be used for advertising or product endorsement purposes.
miket@blia.BLI.COM (Mike Tossy) (11/27/90)
In article <1990Nov20.190315.13505@informix.com>, randall@informix.com (Randall Rhea) writes: > In article <1092@cortex.med.jhu.edu> jamesr@cortex.med.jhu.edu (Jim Rosenberg) writes: >>Hello database gurus. >> >>I am building my first database, and am trying to understand how to balance >>the benefits of normalization with performance requirements. I am >>interested in any heuristics or literate on deciding when to scrap >>normalization to get a performance kick. * * * >>An example issue I am trying to answer: >> I have a domain that can have one of a given set of character string values. >> I am thinking of creating a relation that equates each string with >> an integer ID. The benefits would be less memory used in the relation with >> that domain (since an integer would be stored for each tuple rather than >> an n byte length string), faster comparisons in a qualification (since >> comparisons on that domain would be between integers and not strings), and >> less chance of typo's in the database (since could not enter a string that >> cannot be translated to an ID by a subquery into the ID relation). The >> cost, of course, is the subquery required whenever you want to access the >> domain. Does the answer depend on the length of the strings, etc ... > > A great example! It shows how tricky the whole thing can be. If there were > only 10 possible values for your string, then I would certainly > normalize and use the integer ID that you describe. * * * > If there were only 10 possible string values, I WOULD normalize because: * * * Please, please, please! Don't call this normalization! You are encoding! Encoding is a very useful database techinque which has all the advantages and disadvanteges you discuss; but it is unrelated to the process of normalization. Also, normalization does not always (or even normally) result in performance degradation. Frequently normalization can improve system performance. -- Mike Tossy ShareBase Coropration miket@blia.bli.com 14600 Wichester Blvd (408) 378-7575 ext2200 Los Gatos, CA 95030 (ShareBase is a subsidiary of Teradata Corportation)
randall@informix.com (Randall Rhea) (11/27/90)
In article <1990Nov21.185907.10061@inel.gov> cdm@gem-hy.Berkeley.EDU (Dale Cook) writes: >I think you are confusing design with implementation. Database _designs_, >IMHO, should always be normalized. What is "design" and what is "implementation"? Where does one phase begin and the other leave off? I've never received a consistent answer to that question from the various analysis methods I have come across. Perhaps there is a reason for this; often, there is no clear dividing line between them. >Remember that the goal of the >design is to understand your data and represent the data structure *as it >logically exists* so that others can understand it as well Absolutely correct! What I mean by "over-normalization" is a database design that does NOT represent the logical structure of the data: normalization that makes the data structure *more* difficult to understand. >|> 1. Any programs that need to look at the string would have to perform >|> a sub-query. In some report writer languages, this can be quite difficult. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ >So what? That's why you get the big bucks, isn't it? So what? I'll tell you "so what" .... You won't get the job done, that's "so what". In the real world, we have to complete projects on time and under budget. I have spent countless hours tweaking report writers to look at the zillions of tables that held the data I needed. You may think that "programmer inconvenience" is trivial, but programmer labor expenses are often far and away the most expensive piece of a project's budget. Maybe you guys who work for the government don't have to worry about such things as "costs." (heh heh heh, a tongue-in-cheek comment from a frustrated taxpayer ...) In the ideal world, where some database book authors live, we have perfect computers, unlimited disk space, plenty of good programmers, unlimited time, unlimited funds, and best of all ... a perfect RDBMS tool for the development of the system. In the real world, we never have any of these things. I know that it's unfortunate when one must change the database design (or "implementation") to make up for a bad programming tool. (or bad programmers!) However, this is sometimes a painful reality. Hey, you gotta get the $#@&# program written using the #&@!* lousy report writer. Sometimes, this means the database design must be simplified. If it isn't, you don't get the job done. I know, you can always get another report writer- provided you have the funds to do so. If not, you're stuck. My point is that your programming staff and your programming tools must be taken into consideration when deciding how the database is to be designed/implemented. It's not as trivial as Dale thinks. Overall, though, Dale makes a lot of valid points. Database design/ implementation is not simple. The postings on this net make this painfully clear. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Randall Rhea Informix Software, Inc. Senior Programmer/Analyst, MIS uunet!pyramid!infmx!randall
cdm@gem-hy.Berkeley.EDU (Dale Cook) (11/28/90)
In article <1990Nov27.030450.6251@informix.com>, randall@informix.com (Randall Rhea) writes: |> In article <1990Nov21.185907.10061@inel.gov> cdm@gem-hy.Berkeley.EDU (Dale Cook) writes: |> >I think you are confusing design with implementation. Database _designs_, |> >IMHO, should always be normalized. |> |> What is "design" and what is "implementation"? Where does one phase begin |> and the other leave off? I've never received a consistent answer to that |> question from the various analysis methods I have come across. Perhaps |> there is a reason for this; often, there is no clear dividing line between |> them. I wondered when someone would catch this. The point I was trying to get across is that you need to seperate logical design considerations (e.g., normalization) from implementation considerations (e.g., denormalization). Fully normalize first, THEN _consider_ denormalization. |> |> >Remember that the goal of the |> >design is to understand your data and represent the data structure *as it |> >logically exists* so that others can understand it as well |> |> Absolutely correct! What I mean by "over-normalization" is a database design |> that does NOT represent the logical structure of the data: normalization |> that makes the data structure *more* difficult to understand. |> I still say a normalized design DOES represent the logical structure of the data. If it makes the structure more difficult to understand, it is because it is counter-intuitive to the "way it's always been done". Programmers have historically taken a process view; it is process considerations which generally force denormalization. |> >|> 1. Any programs that need to look at the string would have to perform |> >|> a sub-query. In some report writer languages, this can be quite difficult. |> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |> >So what? That's why you get the big bucks, isn't it? |> |> So what? I'll tell you "so what" .... You won't get the job done, that's |> "so what". In the real world, we have to complete projects on time and |> under budget. I have spent countless hours tweaking report writers to |> look at the zillions of tables that held the data I needed. You may think |> that "programmer inconvenience" is trivial, but programmer labor expenses |> are often far and away the most expensive piece of a project's budget. |> To get projects done on time and under budget, you need proper estimation. This means taking your environment into consideration. If you have schedules imposed without your input, well, that's a management problem. I absolutely agree that labor is the largest cost element in a project. But don't forget, the largest cost element in the life cycle of a project is maintenance. In the narrow short term view, your concerns are valid. In the long term, they are often proven trivial. (BTW, my "so what" was somewhat tongue-in-cheek as well, but I forgot the smiley :-( ). |> |> In the ideal world, where some database book authors live, we have perfect |> computers, unlimited disk space, plenty of good programmers, unlimited time, |> unlimited funds, and best of all ... a perfect RDBMS tool for the |> development of the system. In the real world, we never have any of these |> things. I know that it's unfortunate when one must change the database |> design (or "implementation") to make up for a bad programming tool. (or |> bad programmers!) However, this is sometimes a painful reality. Hey, you |> gotta get the $#@&# program written using the #&@!* lousy report writer. |> Sometimes, this means the database design must be simplified. If it isn't, |> you don't get the job done. I know, you can always get another report |> writer- provided you have the funds to do so. If not, you're stuck. |> |> My point is that your programming staff and your programming tools must |> be taken into consideration when deciding how the database is to be |> designed/implemented. It's not as trivial as Dale thinks. |> Absolutely. I think we are disagreeing on what constitutes programmer convenience, and the relative importance of same. I don't define convenience to be language shortcommings. This is a serious deficiency and must be considered. I do, however, define convenience to include such nonsense as "Just put the precipitation description in the flight table. You only have one at a time, right? That way, I don't have to access another table...". Don't laugh- it happens. I do not really mean to trivialize legitimate programmer concerns. I've been on both sides of the fence. I have equivalent problems with DBA's who "won't bastardize their pristine databases" under any circumstances. Neither of these views advance the company good. |> Overall, though, Dale makes a lot of valid points. Database design/ |> implementation is not simple. The postings on this net make this |> painfully clear. |> All of Randall's points are valid. The painful part is assigning relative importance to ALL factors, over the life cycle of the project. Programmers, given reasonable tools to work with, are the best equipped to deal with complex queries. Shouldn't their objections to the processing complexities imposed by normalization be given less weight than, say, the concern over the long term sharability of the data? ---Dale Cook cdm@inel.gov ========== long legal disclaimer follows, press n to skip =========== ^L Neither the United States Government or the Idaho National Engineering Laboratory or any of their employees, makes any warranty, whatsoever, implied, or assumes any legal liability or responsibility regarding any information, disclosed, or represents that its use would not infringe privately owned rights. No specific reference constitutes or implies endorsement, recommendation, or favoring by the United States Government or the Idaho National Engineering Laboratory. The views and opinions expressed herein do not necessarily reflect those of the United States Government or the Idaho National Engineering Laboratory, and shall not be used for advertising or product endorsement purposes.
pavlov@canisius.UUCP (Greg Pavlov) (11/29/90)
In article <355@6sigma.UUCP>, zuker@6sigma.UUCP (Hunter Zuker) writes: > Aaron Watters writes: > >ALTERNATIVELY there is the phenomenon of the UNJUSTIFIED >NORMALIZATION. Where we make use of an assumption such as > Persons only have one address. > Husband's last name determines Wife's last name. > etcetera. In this case the normalization may DISALLOW information > that a user may later want to insert into the database. This > is another issue. > These are "unjustified" only in the sense that there was no justification for the DBA/designer to use these assumptions in the first place. The last time I had to set up a data base that pertained to family units, I could not make any asumptions about shared last names, addresses, telephones, etc. E.g., I had to take into account instances such as a family with four children where only three of the members had the same last name (the wife and two of the children). The Eisenhower era ended 30 years ago. greg pavlov, fstrf, amherst, ny pavlov@stewart.fstrf.org
cdm@gem-hy.Berkeley.EDU (Dale Cook) (11/30/90)
In article <2020@meaddata.meaddata.com>, gordon@meaddata.com (Gordon Edwards) writes: |> |> IMHO, unjustified normalization refers to normalization for the sake of |> normalization. For example, getting a design in 3NF is adequate to account |> for almost all anomolies. Knowing that further normalization will lead to |> more joins to get any useful work done, I usually stop at 3NF. Remember If what you mean by "normalization for the sake of normalization" is "normalization beyond the point of diminishing return", I'll buy that. But you can do that in going from 2NF to 3NF as well. It all depends on understanding your application. Ensuring 4 or 5NF will probably only pay off for large, long-lived and/or extremely complex databases. |> economies of scale, taking a design from 3NF to BCNF will probably take more |> work that getting the design in 3NF to start with and will almost certainly |> incurr a greater performance penalty. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This is not generally true. 4NF is usually pretty subtle, and involves improper attribution. The first half of your sentence really says it. It is often not worth the effort to _ensure_ 4NF. |> |> In fact, most of the design courses that were developed by DBMS vendors (at |> least that I have seen) only teach normalization to 3NF. |> This is because 3NF *USUALLY* implies 4NF & 5NF. Remeber: We normalize to maximize integrity and stability, and minimize data redundancy. We denormalize _with caution_ for environmental reasons, which should be justified through weighing the alternatives and knowing the possible impacts. --- Dale Cook cdm@inel.gov ========== long legal disclaimer follows, press n to skip =========== ^L Neither the United States Government or the Idaho National Engineering Laboratory or any of their employees, makes any warranty, whatsoever, implied, or assumes any legal liability or responsibility regarding any information, disclosed, or represents that its use would not infringe privately owned rights. No specific reference constitutes or implies endorsement, recommendation, or favoring by the United States Government or the Idaho National Engineering Laboratory. The views and opinions expressed herein do not necessarily reflect those of the United States Government or the Idaho National Engineering Laboratory, and shall not be used for advertising or product endorsement purposes.
tim@ohday.sybase.com (Tim Wood) (12/01/90)
In article <33418@netnews.upenn.edu> aaron@grad1.cis.upenn.edu (Aaron Watters) writes: >In article <1990Nov21.185907.10061@inel.gov> cdm@gem-hy.Berkeley.EDU (Dale Cook) writes: > >On the contrary, programmer convenience is directly related >to the cost of maintaining and extending the database. In >fact you could argue that the entire subject of database systems >is motivated by questions of programmer convenience. >(Why don't >we just write everthing from scratch directly on top of the file >system, anyway?) I don't agree. For large multi-user databases, the emphasis should be on making life easy for the end-users at the remote terminals spread throughout the service area, eg. bank tellers. Making the teller's job of using the computer easy means that transactions will be generated faster during busy periods and the business will be more productive. I'd say programmer convenience is a subordinate priority to end-user convenience, because there are many more users than programmers of large databases. Also, programmer convenience will be determined more by the power of the application development tools than by database features. After all, 4GL applications could manage that flat-file database, though data integrity would be very exposed; users could still be spared the resulting ugliness :-). Database features should be more oriented toward preserving data integrity. Of course, there are easy and hard ways to do that. For a programmer, preserving referential integrity via a trigger is the hard way, via a declaration is the easy way. However, the choice could be hidden in a application tool which either generated a template referential integrity trigger and substituted specific object names, or simply generated the referential declaration. I feel the declarative approach is better, because it is simpler at any level. This makes it the product vendor's responsibility to handle the complexity of building declarative r.i. into the product internals. Complexity doesn't go away, it is just moved around. The idea is to leverage it. >Further, as database systems get more elaborate >and technical the distinction between an end-user and a programmer >blurs rapidly. Not if it's designed properly. As a system gets more features, it's going to be harder to use at some level because of the need to select among the features. But the system should allow hiding of that complexity at appropriate levels so that one only need know a few simple things to do something simple. The larger the learning investment a user needs to make in a system before doing the first unit of useful work, the less optimal is that system's design. >I would draw the alternate conclusion that >database systems should provide facilities that reduce the >hassles of working with a highly normalized databases -- such >as good support for complex views among other things. It is >not clear this is always done. -aaron Agreed (funny, since I disagreed with your premises :-). That's why adding complexity is costly--to do it right, you have to concurrently create schemes for hiding it. -TW Sybase, Inc. / 6475 Christie Ave. / Emeryville, CA / 94608 415-596-3500 tim@sybase.com {pacbell,pyramid,sun,{uunet,ucbvax}!mtxinu}!sybase!tim Dis claim er dat claim, what's da difference? I'm da one doin da tawkin' hea.