[comp.databases] Pro-Pre-Relational

chris@AUSTIN.LOCKHEED.COM (Chris Wood) (01/07/88)

Just for the heck of it, I will give you a positive reaction to 
pre-relational technology.

There are/were a number of "good things" about network/hierarchical models:

1. Navigation between record types (relations/tables) is automatic in a
hierarchical system.  That is, in a Hierarchy, there is one and only one
path between two nodes, and that path is by definition already set up.  In
a relational system, the user must do join type operations as part of the
query.

NOTE: I have seen or heard of several "relational-like" DBMSs that allow
      "presetting" such relationships to allow software to automatically
      calculate a path between record types as part of determining how to
      satisfy a query.

2. CODASYL/Network DBMSs had a little feature called "Place Near" that
allowed related record occurrences of more than one record type to be
physically clustered, thus optimizing performance of retrievals.

How can relational implementations do this unless they know about such
relationships?

3. Many hierarchical and Network DBMSs allow repeating fields and even
repeating groups of fields.  Relational "purist" violently object to this
on the grounds that it is not "normalized". 

However, consider the following scenario:
I have a General Ledger application with 4000 account entries.  I need to
keep track of 12 months worth of data for each account.  In the relational
view, I am forced to build 2 tables, a GL table with 4000 records, and a
Monthly table with 36000 records.  In a system that allows repeating fields,
I build 4000 records in a single table.  This has a very large impact on
speed of retrieval obviously.  On average, it should take about 10 times
as long to examine 40000 records as 4000 records.  Of course the records
are much shorter in the relational implementation, and the known 
maintenance headaches of variable length records is well known.  But still,
the relational implementation is quite inefficient.

Note: a professor at the University of Texas is currently studying extending
SQL to include "imbedded relations", aka repeating groups.

----------------------------MAIN POINT--------------------------------

Not everything in hierarchical/network technology is bad.  We should learn
from both our successes as well as our mistakes.  I think that some of the
good features of these "old" technologies should be salvaged, and 
incorporated into relational implementations.


FLAME AWAY, This should be fun!

Chris Wood   ------------------------ 15+ years playing with DBMS:
                                      SYSTEM 2000, TOTAL, IDMS, MODEL 204,
                                      ORACLE, VAX RDB, BTRIEVE (IBM/PC), and
                                      even a little IMS (Ugh!), INQUIRE, etc. 

ok@quintus.UUCP (Richard A. O'Keefe) (01/08/88)

I strongly urge everyone who is interested in this debate,
and who hasn't already got a copy, to try to obtain access to a
copy of
	"Relational Database : Selected Writings",
	C. J. Date,
	Addision-Wesley, 1986
	US$ 35 (hardback)
The relevant chapter at the moment is Chapter 6:
	"Some Relational Myths Exploded".

In article <68@coot.AUSTIN.LOCKHEED.COM>,
chris@AUSTIN.LOCKHEED.COM (Chris Wood) writes:
> There are/were a number of "good things" about network/hierarchical models:
> 2. CODASYL/Network DBMSs had a little feature called "Place Near" that
> allowed related record occurrences of more than one record type to be
> physically clustered, thus optimizing performance of retrievals.
> 
> How can relational implementations do this unless they know about such
> relationships?
They can't, any more than the CODASYL ones could.
Both the DBTG stuff and the relational model distinguish between
the abstract design and the physical layout.  Putting something near
something else is a physical question.

This is Date's
    "MYTH NUMBER 6: The data must be hierarchically clustered
		    for good performance."
As Date says "what is generally overlooked when such claims are made
is that such interleaving biases the database TOWARD some applications
but AGAINST others."  (Page 89.)

It also ties in with
    "MYTH NUMBER 7: Hierarchic clustering requires pointers."
which is exploded on page 90, on which he says that ORACLE actually
lets you do it.  The sketch is

	CREATE CLUSTER DEPTEMP
		DEPT# ...
	ALTER CLUSTER DEPTEMP
		ADD TABLE DEPT
		WHERE DEPT# = DEPTEMP.DEPT#
	ALTER CLUSTER DEPTEMP
		ADD TABLE EMP
		WHERE DEPT# = DEPTEMP.DEPT#

Note that this does not at all change what the relations look like
or what you can do with them, only where tuples are stored.


> 3. Many hierarchical and Network DBMSs allow repeating fields and even
> repeating groups of fields.  Relational "purist" violently object to this
> on the grounds that it is not "normalized". 
> 
> However, consider the following scenario:
> I have a General Ledger application with 4000 account entries.  I need to
> keep track of 12 months worth of data for each account.  In the relational

Actually, they will refer to "First Normal Form".
The relational *model* says that data items should be atomic *with
respect to the application*.  What this means is actually pretty
vague.  Many *implementations* say that a data item should be some
sort of number, string, or timestamp.  Basically, yes, if there is
structure that your application is interested in, it should be
explicit in the relations.

In this case, since there are 12 months in every year, why not
have one relation with 12 separate attributes over the same domain?
If some of the values are known and others aren't, that's exactly
what NULL values are for.

> speed of retrieval obviously.  On average, it should take about 10 times
> as long to examine 40000 records as 4000 records.  Of course the records

Not if the 4,000 records are 10 times bigger than the 40,000.  What counts
is number of disc accesses.  Suppose that you have 4,000 records each
64 bytes long and 36,000 records each 12 bytes long, whereas if you
had been able to pack them together you'd have had 4,000 records each
112 bytes long.  Then the repeating-field version would have taken
448,000 bytes, and the "flat" version would take 688,000 bytes, a
ratio of about 1.54.  If you are examining all the information, the
slow-down would probably be about the same, a factor of 1.5, NOT a
factor of 10.  What the factor actually is depends on the layout your
particular data base system picks, of course.

But what is so special about repeating fields?  That's only warmed-over
COBOL.  Why not let me use any data structure my programming language
will support?  Let's see, arbitrary sized trees, logical variables,
arbitrary precision integers, ...  You mean COBOL doesn't do that?  Oh.
Why not let me store the triangular arrays, N-dimensional tables with
margins, experiment designs, and so on I use in GENSTAT?  You mean COBOL
doesn't do that?  Oh.  Look, you have to draw the line *somewhere*.

> ----------------------------MAIN POINT--------------------------------
> 
> Not everything in hierarchical/network technology is bad.  We should learn
> from both our successes as well as our mistakes.  I think that some of the
> good features of these "old" technologies should be salvaged, and 
> incorporated into relational implementations.
> 
Absolutely right.  (Er, what were those two successes, again?)

UH2@PSUVM.BITNET (Lee Sailer) (01/08/88)

In article <68@coot.AUSTIN.LOCKHEED.COM>, chris@AUSTIN.LOCKHEED.COM (Chris Wood) says:
>
>Just for the heck of it, I will give you a positive reaction to
>pre-relational technology.
>
>There are/were a number of "good things" about network/hierarchical models:
>
>1. Navigation between record types (relations/tables) is automatic in a
>hierarchical system.  That is, in a Hierarchy, there is one and only one
>path between two nodes, and that path is by definition already set up.  In
>a relational system, the user must do join type operations as part of the
>query.
>
     
It is only "automatic" if you want the path that was chosen by the designer.
If you want any other path, you have to twist and squirm.  IF the
application will only need one or two paths, the H is fine.  If you
need many paths, or if you MIGHT need many paths in the future, the
H is not enough.
     
When H was all there was, people naturally found all the Info Systems
that were easy to implement using the H approach.  These systems worked
then and they still work, and there is no reason to change them.
However, many procedures are very hard to do in an H system, and if
you need one of them, you're stuck.
>
>2. CODASYL/Network DBMSs had a little feature called "Place Near" that
>allowed related record occurrences of more than one record type to be
>physically clustered, thus optimizing performance of retrievals.
>
>How can relational implementations do this unless they know about such
>relationships?
>
     
Concepts like place near do not always optimize performance.  For example,
a programmer might navigate the database from the wrong entry point (from
the point of view of the "place near").  The criticism of Network Systems
is that they REQUIRE programmers and end-users to know too much about
the physical implementation before they can write efficient queries.
     
Likewise, the designed in "place near" might seem like a good choice
when the system is implemented, but turn out to be the wrong choice when
users start accessing the database.
     
Modern design strategies try to delay the desicions about physical implementati
on as long as possible.  Also, they try to provide the capability of
changing the underlying disk stu=ructures without breaking any of the
applications.   ANY DBMS must be able to collect performance statistics
at run time, so that the Database Manager can see that the records in X
tend to be accessed together with the records in Y, and see that X and Y
are stored on the same cylinder, or different disks, or whatever, to
improve performance.  In fact, there is NO reason that this monitoring and
remodeling of the disk structures could not be done automagically by the
file system, relieving users of the burden completely.
     
>3. Many hierarchical and Network DBMSs allow repeating fields and even
>repeating groups of fields.  Relational "purist" violently object to this
>on the grounds that it is not "normalized".
>
     
Nobody is forcing you to normalize your underlying relations.  In
most cases, normalization is better.  In some cases it isn't.
     
The rational for normalization is that certain types of common mistakesthat pro
grammers make are less likely.  If your programmers don't make those
particular mistakes, then don't normalize.
     
Case I.  Your programmers don't make those mistakes because the application
doesn't call for those particlualr kinds of database access.  You're
OK til the user needs change.
     
Case II.  Your programmers are brilliant.  Your in deep trouble.
They are about to quit.
     
>----------------------------MAIN POINT--------------------------------
>
>Not everything in hierarchical/network technology is bad.  We should learn
>from both our successes as well as our mistakes.  I think that some of the
>good features of these "old" technologies should be salvaged, and
>incorporated into relational implementations.
>
     

dc@gcm.UUCP (01/09/88)

In article <68@coot.AUSTIN.LOCKHEED.COM> chris@AUSTIN.LOCKHEED.COM (Chris Wood) writes:
>
*3. Many hierarchical and Network DBMSs allow repeating fields and even
*repeating groups of fields.  Relational "purist" violently object to this
*on the grounds that it is not "normalized". 
*
*However, consider the following scenario:
*I have a General Ledger application with 4000 account entries.  I need to
*keep track of 12 months worth of data for each account.  In the relational
*view, I am forced to build 2 tables, a GL table with 4000 records, and a
*Monthly table with 36000 records.  In a system that allows repeating fields,
*I build 4000 records in a single table.  This has a very large impact on
*speed of retrieval obviously.  On average, it should take about 10 times
*as long to examine 40000 records as 4000 records.  Of course the records
*are much shorter in the relational implementation, and the known 
*maintenance headaches of variable length records is well known.  But still,
*the relational implementation is quite inefficient.

Now we know why he wrote this article

>FLAME AWAY, This should be fun!

Please don't flame away.  
The above example isn't a repeating group.  However it *is* true that SQL 
cat not process it efficiently.  A repeating group would be using 12 records.

tim@mtxinu.UUCP (Tim Wood) (01/11/88)

In article <68@coot.AUSTIN.LOCKHEED.COM> chris@AUSTIN.LOCKHEED.COM (Chris Wood) writes:
>
>There are/were a number of "good things" about network/hierarchical models:
>
>1. Navigation between record types (relations/tables) is automatic in a
>hierarchical system.  [...]  In
>a relational system, the user must do join type operations as part of the
>query.
>
>2. [paraphrased, i lost the text in some vi buffer:]
> CODASYL offers a "place-near" option to physically cluster joining
> records ...
>
>How can relational implementations do this unless they know about such
>relationships?
>NOTE: I have seen or heard of several "relational-like" DBMSs that allow
>      "presetting" such relationships to allow software to automatically
>      calculate a path between record types as part of determining how to
>      satisfy a query.

Um, Sybase allows you to get the optimizations you speak of without
the limitations of network/hierarchical.  At a low level, it allows
the use of nonclustered index entries as data when only key values
are being accessed; so if all your search conditions are on key values
for all tables in the query (which will usually be true in a fully
normalized database), you will never do any linear scanning of rows
with different key values, and rows matching a key value are found in 
O(log(#-of-rows-in the table)) time.  The system's query optimizer picks
this strategy automatically; all the user needs is a good database design.

At the SQL level, use of stored procedures allows saving of this strategy
in the DBMS, so the database is always accessed with a pre-compiled optimal
query plan.  No application need do anything special to get the best access
path, other than conform to the database schema.  If the DBA changes the
schema, procedures using it are automatically reoptimized at the next
invocation.

>3. Many hierarchical and Network DBMSs allow repeating fields and even
>repeating groups of fields.  Relational "purist" violently object to this
>on the grounds that it is not "normalized". 
>
>However, consider the following scenario:
>I have a General Ledger application with 4000 account entries.  I need to
>keep track of 12 months worth of data for each account.  ...
>   [explanation of 1-to-N mapping of acct. records to activity records]
>In a system that allows repeating fields,
>I build 4000 records in a single table.  
>On average, it should take about 10 times
>as long to examine 40000 records as 4000 records.  

Not if you've set your schema up properly.  See above.  Strictly speaking,
repeating fields aren't "relational"; in a good DBMS with the right schema, 
the compromise is not necessary.

I agree with Chris' implication (my inference?) that relational systems
have been naive about exploiting optimal storage structures.  The
trick is to use them without sacrificing data independence, which is
a unifying (not the product :-)) concept of RDBMS.
-TW

This posting represents my opinion, not necessarily that of
Sybase, Inc. or any of its other employees.  Though not required to do
this, I want to distinguish between my view of the DBMS scene and
Sybase's.
-- 
{ihnp4!ptsfa,pyramid,{uunet,ucbvax}!mtxinu}!sybase!tim

mjr@well.UUCP (01/16/88)

>In article <68@coot.AUSTIN.LOCKHEED.COM> chris@AUSTIN.LOCKHEED.COM (Chris Wood) writes:
>>On average, it should take about 10 times
>>as long to examine 40000 records as 4000 records.  

Not true, because the second relation (with the monthly data that would
otherwise go into the repeating fields) will have very short records,
so will be quickly searched (at least much better than 10 times worse
than 4000 much longer records)...

All large scale relational databases I know of (bearing in mind that
the best only implement 50% of the model) allow for the kind of
physical optimization you refer to...  Perhaps I shouldn't say all,
but I know this is true of Cincom's Supra and DB2, aparantly for
Sybase as well...
mjr@well