tiberio@seismo.UUCP (04/01/87)
At our facility we have a certain large investment in a relational database design that is not in the highest normal form. We have some folks proposing that we redesign our database (and no doubt out database interface software) to a greater or highest normal form. They state we will enjoy the following qualities when this is done. 1) The ammount of space required to store the data is lowest. 2) The ability to update the database is greatest. 3) Explaining the database is simplest. While I agree with the first point, If disc space is not a problem I don't see where this is a big win. I'm not really sure under point two if they mean update the database or update the database design. We update (or replace in INGRES terms) our databases very little once they are established. I disagree with the third point, especially if I have to explain this twisted mess of relations (56 vs 24) to our mythical "joe" user. What I am asking you folks in netland is for some guidance. I like the intellectual concept of normal form relational databases, but I need to know if there are any good reasons not to go that route. I would also like to know if there are other advantages that the proposers failed to mention. I might add that we seem to have plenty of money for hardware, but that man hours are at a premium. Since this group has very little traffic, I don't see why we can't discuss this on the net, but please feel free to send me E-mail for any mega-responses. References to texts and/or papers would be welcome. beno!tiberio seismo!tiberio
roger@esquire.UUCP (04/01/87)
Since you seem to be discussing an application which already is up and running, why "improve" it? It apparently does what you need it to do, in a manner not inconsistant with your resources. I'm thinking of one of our applications which took about 8 man-years to bring up. We will not make changes in the database scheme, period. Simply because we do not care to have to go hack everything to reflect the new scheme. Rule 1: IF IT AINT BROKE, DON'T FIX IT. Rule 2: Academic excercizes are for academics. Roger Reid cmcl2!esquire!roger
lien@osu-eddie.UUCP (04/02/87)
Normal form is mainly design for update anomalies. Let's take a trivial example. If you design such a relation to store the grade of a class: SSN, HOMEWORK1, HOMEWORK2, MIDTERM, FINAL, AVE Whenever you updaate the values of HOMEWORK1, HOMEWORK2, MIDTERM, or FINAL, you have to update the values in AVE at the same time. Otherwise, you may get an inconsistent database. A good database should not assume that every user will follow the rule carefully. Thus, this database is dangeous. (Actually, a good database doesn't need AVE since it can be caluculated. This is not a good example.) The problem in this relation is that AVE is transitively dependent on the key (SSN). Normalization theory tells us that it is not in 3NF. If we can design it in 3NF or higher, this anomaly can be avoided. This is the point behind normalization theory. I don't have the complete knowledge about it. Please consult Date's book or Ullman's book to get a better picture. Yao-Nan Lien -- ------------------------------------------------------------ Yao-Nan Lien Department of Computer and Information Science Ohio State University 2036, Neil Ave. Mall Columbus, Ohio 43210-1277 Tel 614 292-5236 CSNet : lien@ohio-state.CSNET Arpa : lien@ohio-state.arpa UUCP : cbosgd!osu-eddie!lien
jeff@rtech.UUCP (04/02/87)
in article <43239@beno.seismo.CSS.GOV>, tiberio@seismo.CSS.GOV (Mike Tiberio) says: > > At our facility we have a certain large investment in a relational database > design that is not in the highest normal form. We have some folks proposing > that we redesign our database (and no doubt out database interface software) > to a greater or highest normal form. They state we will enjoy the following > qualities when this is done. > > 1) The amount of space required to store the data is lowest. This is usually true, but not necessarily so. > 2) The ability to update the database is greatest. > > 3) Explaining the database is simplest. Sometimes you can run into update problems in unnormalized databases, because a single relationship can be stored in more than one place. For example, suppose you are keeping track of parts, suppliers of parts, and orders from those suppliers, and one of your tables looks like this: Supplier | City | Part | Color | Quantity ----------------------------------------- J1 | NY | p1 | red | 100 J1 | NY | p2 | white | 200 S6 | LA | p2 | white | 50 L3 | DC | p2 | white | 300 L3 | DC | p3 | blue | 100 Now suppose the supplier J1 moves from NY to SF. You will have to update J1's city in two places. Suppose you want to keep track of a new supplier, R2, who doesn't currently supply you with anything. There will have to be a row in the table in which part, color, and quantity are blank. Suppose supplier S6 stops supplying the part p2. To delete that row from the table would erase all information about S6, which you don't necessarily want to do; instead, you must recognize the case where there is only one row left for this supplier, and blank out the part, color, and quantity for that row. The problem here is normalization. Without going into any detail on what it means, here is the above information in third normal form: Supplier Table Part Table Supplier | City Part | Color --------------- ------------ J1 | NY p1 | red S6 | LA p2 | white L3 | DC p3 | white On-order Table Supplier | Part | Quantity -------------------------- J1 | p1 | 100 J1 | p2 | 200 S6 | p2 | 50 L3 | p2 | 300 L3 | p3 | 100 Normalization is sometimes explained with a bunch of fancy jargon, but what it boils down to is this: Every table should stand for a single thing. In the example, the problem with the original table is that it stands for three things: suppliers, parts, and orders. Jamming these together into one table causes problems when we want to update some aspect of one of the things. A normalized database is easier to understand because the individual tables are simpler: each one stands for only one thing. The only thing that may become harder is figuring out how to join the various tables together; the main way to fight against this is to make sure that each of your tables stands for some real thing in the problem you're trying to solve (if one is not careful, one can come up with tables that don't model the problem, but instead model the underlying algorithm; it's a good idea to design your data first, and then figure out your algorithms). This makes it sound easier than it really is. Figuring out what your "things" are can be difficult. > What I am asking you folks in netland is for some guidance. I like the > intellectual concept of normal form relational databases, but I need to > know if there are any good reasons not to go that route. The main argument against normalization is performance. Normalization involves splitting unnormalized tables up into smaller tables. But this means you will have to join the tables to recreate the relationships in the original, unnormalized form. Joins are slower than fetches from single tables. In the above example, suppose that you often need to know which parts are coming from which cities. In the normalized database, you would have to join the supplier and the on-order tables. If this took too much time, you might decide to unnormalize your database and merge the two tables into one: Supplier-Order Table Supplier | City | Part | Quantity --------------------------------- J1 | NY | p1 | 100 J1 | NY | p2 | 200 S6 | LA | p2 | 50 L3 | DC | p2 | 300 L3 | DC | p3 | 100 Some of the update and understanding problems come back, but it will run faster for the operation we're interested in. -- Jeff Lichtman at rtech (Relational Technology, Inc.) "Saints should always be judged guilty until they are proved innocent..." {amdahl, sun}!rtech!jeff {ucbvax, decvax}!mtxinu!rtech!jeff
phgun@violet.berkeley.edu.UUCP (04/02/87)
While I agree that in most cases it's important to at least have the relations in 3NF, sometimes there's no neccessity for it. Take the example where my relation is R(NAME, PHONE#, ROOM#) where NAMEs are unique, each person has his own PHONE#, and more than one person may be assigned to the same ROOM#. R is not in 3NF, since ROOM# is transitively dependent on the key NAME (i.e. NAME->PHONE#, PHONE->ROOM#, NAME->ROOM#). By Normalization theory, we want to (non-loss) decompose R into R1(NAME, PHONE#) and R2(PHONE#, ROOM#). But this is only if there's a reason to know the assignment of phones to rooms independent of who is occupying them. Otherwise I don't see any reason why we should worry about "losing" information about the location of phones in rooms if we delete some name from R, if it is of no significance to us (e.g. we rip off the line if no one's using it!). But if we have lots of queries pertaining to "who is in room X?", or "which room is Y located in?", this decomposition means we need to always JOIN R1 and R2, which is a performance LOSS. Note that R cannot be decomposed into R1(NAME, ROOM#) and R2(ROOM#, PHONE#) because this is a lossy-join decomposition, though on the surface, they gives us a more efficient way of executing the 2 queries above (after all they're both in 3NF, but the joining-attribute is not a key in either relations). ---Himawan Gunadhi ARPA: phgun@violet.berkeley.edu UUCP: ...!ucbvax!jade!violet!phgun BITNET: phgun@ucbviolet.bitnet
billc@blia.UUCP (04/03/87)
In article <43239@beno.seismo.CSS.GOV>, tiberio@seismo.CSS.GOV (Mike Tiberio) writes: > ... that we redesign our database > to a greater or highest normal form. They state we will enjoy the following > qualities when this is done. > > 1) The ammount of space required to store the data is lowest. This is true. However, you pay a performance penalty, which I'll talk about below. It's the old space vs. speed business. > 2) The ability to update the database is greatest. I'm not at all sure what is meant by this. I don't think this makes the schema any easier to change. You can build in some better integrity checking, but I don't want to go into that now, since there are other options for those who choose non-normalized schemas (schemae?). > 3) Explaining the database is simplest. I also strongly disagree. Try to write a tutorial using a heavily-normalized schema as the example database! Normalization is a religious issue. I know a number of people who strongly favor it, and it does indeed have its place. However, I think there are some good counter-arguments for CERTAIN situations. (No, I am not anti-normalization, but I am also not a normal-form worshipper.) The biggest disadvantage is in processing time. Modelling has shown that the stronger the normalization, the poorer the overall performance of the database. More normalization means more joins. Another disadvantage is that a heavily normalized database can be difficult to tune. There are many ways of keying together two relations. SOme of them can get pretty slow. It takes more adminstrative "cleverness" to handle a heavily-normalized schema. And now, the good stuff. Don't blindly stash everything in one giant table, there are many situations where normalization pays off. A large amount of fairly redundant information that is infrequently accessed is clearly a good candidate for spinning off into its own relation. Normalization also gives richer semantics and greater flexibility -- you can't do one-to-many relationships without normalization. -- W.H.Coffin. billc@blia.BLI.COM Or, if you really like source routing, try ucbvax!{mtxinu|ucsfcgl}!blia!billc >> the usual disclaimer about my employer and my wretched opinions. << >> the usual witticisms that swell netnews to ridiculous proportions. <<
eric@hippo.UUCP (04/03/87)
Higher normal forms are good, but only in a perfect world. It is true that it will take less disk space (since any data item will only be stored once), and the inherent design of the database is "cleaner". It also makes updates easier, since the only replicated data are key values. What they are not telling you is that the penalty can be performance. Third normal form databases can quickly lead you to multi-table joins instead of simple retrieves out of a single table. Being the veteran of several projects that used relational databases (sometimes quite large and complex), I can tell you that our usual strategy is to design the database initially to third normal form (using ER diagrams, or whatever), and then start looking at where the bottlenecks might be, and where replication might be in order to achieve the necessary performance. But it is alot better to start in third normal form and work backwards (at least then you understand and document why you replicated a particular set of data), then to just shoot for a lower order of normalization to start with. -- eric ...!ptsfa!hippo!eric
allbery@ncoast.UUCP (04/04/87)
As quoted from <43239@beno.seismo.CSS.GOV> by tiberio@seismo.CSS.GOV (Mike Tiberio): +--------------- | At our facility we have a certain large investment in a relational database | design that is not in the highest normal form. We have some folks proposing >... | 3) Explaining the database is simplest. >... | While I agree with the first point, If disc space is not a problem I | don't see where this is a big win. >... | I disagree with the third point, especially if I have to explain this | twisted mess of relations (56 vs 24) to our mythical "joe" user. +--------------- It *is* easier. Our database at TDI has very little normalization, being basically unchanged from its original form as a FilePro-16 (early version) database, which was not very relational at all. In particular, the labor tracking system for factory orders is something like: order --->> lineitem \ | \ v ------> labdesc ^ | ----^ v / v routing ->> labordtl \ ^ \ ^ ------>> location One arrow is a one-to-one, two make a one-to-many. I also omit the item file which joins one-to-many with the lineitem and routing tables, and a number of other relationships. Normalizing this adds more tables, but the resulting relationships are MUCH easier to follow. (I have attempted to chart the current relationships in our application; talk about spaghetti...) One thing that must be remembered, however: Normalization is not ALWAYS a good idea. For instance, it can be argued that cities should be ``normal- ized'' onto states, ditto for zip codes, UPS zones should be normalized to ZIP codes, and area codes normalized to states; while tables expressing these relationships produce an increased normalization, they are in practice a bit ridiculous. (Hmmm, that last sentence doesn't make much sense even to me. Maybe I should try again?) A practical example: an order entry application can have a customer mailing address, a customer shipping address, and ways to override these addresses for specific orders (shipping) and invoices (mailing). Normalization says that addresses should be stored in a separate address file in all cases and pointers used from the various tables; this saves space in the tables. However, in practice at least the normal customer mailing address will be kept in the customer file rather than the address file, as it will be the most commonly used and needs to be available for ready reference. (The UNIFY 3.2 Programmer's Manual contains a very complete discussion on normalization, when to apply it, and when *not* to use it; it does a far better job than I am....) ++Brando -- ____ ______________ / \ / __ __ __ \ Brandon S. Allbery | /^\ USS ___ | /__> / \ / \ aXcess Company | A A A CHALLENGER / \ | | `--, `--, 6615 Center St. #A1-105 | H V H | | \__/ \__/ \__/ Mentor, OH 44060-4101 | H . H SEVEN \____/ \______________/ +1 216 974 9210 | / | \ WHO ________________________________________________________| /___|___\ DARED As long as there is the spirit to dare new frontiers, the dream will never die. cbatt!cwruecmp!ncoast!allbery ncoast!allbery%case.CSNET@relay.CS.NET BALLBERY (UUCP) (CSNET/Internet) (MCIMail)
vollmer@cod.UUCP (04/05/87)
in article <43239@beno.seismo.CSS.GOV>, tiberio@seismo.CSS.GOV (Mike Tiberio) says: > > At our facility we have a certain large investment in a relational database > design that is not in the highest normal form. We have some folks proposing > that we redesign our database (and no doubt out database interface software) > to a greater or highest normal form..... A real world problem I have had with going overboard with normal forms is that many of the DBMS 4GL application building tools have a real problem trying to put the whole thing back together again. You have data safely tucked away in "their own" relations and spend a lot of time trying to kludge the application tools to try to make reports and data entry sessions not look like they are aimed at 1001 tables. There are real gains from normalization, but the problem is trying to balance performance and use of the 4GL tools on the design. That balance is both the fun and the art of making relational work.
mjr@well.UUCP (04/05/87)
In article <43239@beno.seismo.CSS.GOV> tiberio@seismo.CSS.GOV (Mike Tiberio) writes: >At our facility we have a certain large investment in a relational database >design that is not in the highest normal form. We have some folks proposing >that we redesign our database (and no doubt out database interface software) >to a greater or highest normal form. They state we will enjoy the following >qualities when this is done. > *** etc. *** Normal form is better for many of the reasons you state, AND because when you update an attribute, you only have to do it ONCE as it only appears in ONE table and not in many... Also, each "thing" you are describing in the real world, while it may appear in several tables (say invoices in one table, and invoice lines in another), there is a clear cut way to distinguish the rows (tuples) that belong to a (represent) particular thing in the real world. The two main justifications for NOT normalizing are 1) Performance (something you query a hundred times a day say may take too long if parts of it are scattered across many tables), and 2) recoverability (if you loose your only copy what happens...). For an excellent discussion of these and other aspects of data-base design, see Daniel Martin's ADVANCED DATABASE TECHNIQUES. This book is about Entity-Relationship modeling generally, but it does have several chapters devoted to general issues such as those mentioned above! (that's MIT Press, 1986) Or, I happen to be writing a column on the subject of DBMSs generally in Computer Language Magazine. Feel free to write to mjr@well or ptsfa!well!mjr, etc...
mjr@well.UUCP (04/06/87)
Well Jeff, you certainly got things stirred up here... In general, I don't think that normalization is a "religious" issue, there are sound technical and logical reasons for using it. On the other hand, "if it ain't broke don't fix it" works if you are aware of the problems you can get into with unnormalized relations, and prepare to deal with them when they come up. Date has been mentioned here but here are references to his most famous work (really the best treatment of the whole of relational theory I've seen) C.J. Date "An Introduction to Database Systems" Addison-Wesley, 1986 (fourth Edition.) Vol I and also Volume II which tends to deal with more "real world" problems after the foundation is layed in Vol I... Be well, mjr@well
billc@blia.UUCP (04/14/87)
The point of my earlier posting was simply that there is no formula that can replace common sense in designing a schema. There are people who are religious about normalization in the sense that they believe in rigorous normalization no matter what the situation. I suppose it IS easier to explain a normalized database, given a high level of complexity. I was thinking, though, about tutuorial level indoctrination into query languages. I'll take a concrete example: it is very hard to explain correlated subqueries to the poor first-time SQL user. (And indeed, they are a horrible botch.) When describing correlated subqueries, I like to start with single-table examples. If the example schema is normalized, the simplest "sensible" example query may have 2 or 3 joins involved. This adds a lot more "noise", and makes the examples a good deal more difficult to comprehend. Certainly, once a user understands the query language a somewhat-normalized database can be easier to understand, but again, the normalization should be pursued with common sense rather than with formulaic rabidity :-). -- W.H.Coffin. billc@blia.BLI.COM Or, if you really like source routing, try ucbvax!{mtxinu|ucsfcgl}!blia!billc >> the usual disclaimer about my employer and my wretched opinions. << >> the usual witticisms that swell netnews to ridiculous proportions. <<
mjr@well.UUCP (Matthew Rapaport) (04/18/87)
I wasn't thinking so much about teaching someone a query language as designing a database in the first place. Knowing the logical advantages to normalization, it makes sense to begin that way, and make deliberate and reasoned decisions NOT to normalize when appropriate yours truely mjr@well