[comp.databases] Tricky database definition problem

davidm@uunet.UU.NET (David S. Masterson) (01/11/90)

I've been having a problem designing a few databases on a couple of projects
that have an inherent problem that I can't seem to normalize:

Assume that you have some entities that each have a list of properties related
to it.  For instance, a given entity X might have properties (a, b, c) whereas
another entity Y might have properties (a,b, x, y, z).  There is nothing in
the definition of properties that separates them into different types of
entities/attributes -- they are all just "properties".  Note also that each
entity can have a different list of properties both in number and value (X has
3, Y has 5).  Finally, do not assume that either the entity identifier or the
property values are representable by one attribute (in the basic problem
definition -- I don't know about contrived attributes).

My inclination would be to model this in a relational database as two tables
(one for entities and one for properties).  The two tables would relate to one
another via the EntityID (in this case, X or Y).

Given this, how would someone write an SQL query to answer a request like
"Find the entities having all of the following list of properties (...)"?  If
that turns out to be "not too tough", then how about the request "Find the
entities having all of these properties (...) or all of these properties
(...)?"  Finally, how about the request "Find the entities having any of these
properties (...) and any of these properties (...)"?

Inquiring minds want to know...  ;-)
--
===================================================================
David Masterson					Consilium, Inc.
uunet!cimshop!davidm				Mt. View, CA  94043
===================================================================
"If someone thinks they know what I said, then I didn't say it!"

may@28182.dec.com (Patrick May) (01/11/90)

In article <CIMSHOP!DAVIDM.90Jan10105028@uunet.UU.NET>, cimshop!davidm@uunet.UU.NET (David S. Masterson) writes...
>Assume that you have some entities that each have a list of properties related
>to it.  For instance, a given entity X might have properties (a, b, c) whereas
>another entity Y might have properties (a,b, x, y, z).  There is nothing in
>the definition of properties that separates them into different types of
>entities/attributes -- they are all just "properties".  Note also that each
>entity can have a different list of properties both in number and value (X has
>3, Y has 5).  Finally, do not assume that either the entity identifier or the
>property values are representable by one attribute (in the basic problem
>definition -- I don't know about contrived attributes).

What you need is an associative table between an entity table and a property
table.  Rows in the associative table would be composed of the primary key
from the entity table and the primary key of one of the associated (hence the
name) property table rows.  This approach allows you to store non-primary key
information about the properties once and once only.

>Given this, how would someone write an SQL query to answer a request like
>"Find the entities having all of the following list of properties (...)"?  If

Queries of this sort are then trivial accesses of the associative table.

davidm@uunet.UU.NET (David S. Masterson) (01/12/90)

In article <49@arkham.enet.dec.com> may@28182.dec.com (Patrick May) writes:

   What you need is an associative table between an entity table and a
   property table.  Rows in the associative table would be composed of the
   primary key from the entity table and the primary key of one of the
   associated (hence the name) property table rows.  This approach allows you
   to store non-primary key information about the properties once and once
   only.

I'm beginning to see that I left out a key ingredient to my question.  Seeing
the replies has shown me that it is a key ingredient -- I didn't realize it
before.  What I was trying to do was to arrange ("normalize"?) my database in
order to answer all possible queries in a static fashion.  That is, I didn't
want to use dynamic SQL to answer a query like "where property in <X,Y> or
property in <M,N> and ..." which I beginning to see is not possible (the query
gets even worse when property becomes multivalued).  I don't suppose there is
an SQL construct like "where <type, value> is in <<M,x>,<N,y>>" (the two where
clauses are not really related)?  Perhaps such a construct would have value?

--
===================================================================
David Masterson					Consilium, Inc.
uunet!cimshop!davidm				Mt. View, CA  94043
===================================================================
"If someone thinks they know what I said, then I didn't say it!"

davidm@uunet.UU.NET (David S. Masterson) (01/12/90)

In article <CIMSHOP!DAVIDM.90Jan11125238@uunet.UU.NET> I write:

   I'm beginning to see that I left out a key ingredient to my question.

Forgive me, but I've been looking at this some more and maybe my question
wasn't as wrong as I thought.  Let me see if I can make the problem more
concrete.

I have a database consisting of observations.  Each observation consists of a
<DataType, DataValue> pair (the observation might be water height or machine
temperature or any number of other things).  Associated with each pair is a
list of <LocationType, LocationValue> pairs that help identify where the
observation occurred.  This can be defined simply (and rather redundantly) in
one table like (with some example data):

	Observation
	DataType	DataValue	LocType		LocValue
	WaterHeight	10		Shore		South
	    "		 "		Time		12:30
	    "		 "		Observer	Bill
	WaterHeight	15		Time		12:30
	    "		 "		Observer	Fred
	WaterTemp	60		Time		1:30
	    "		 "		Observer	Tom
	WaterTemp	50		Time		2:30
	    "		 "		Observer	Gerry

With a table like this (or more normalized), can a query be written like to
find <DataType, DataValue> that are constrained by some <LocType, LocValue>
pairs.  For instance:

1.  Find the WaterHeight when Time was 12:30 and Observer was Bill.
2.  Find the WaterHeight for Observer Fred and Shore is South.
3.  Find the WaterTemp for Observer (Tom or Fred) and Time was 1:30.

Since the qualification clause is used to qualify a row in a table (either in
the database or joined in memory), none of the above rows qualify as direct
answers to the query.  Yet, this is obviously not what is wanted.  There seems
to be nothing in the definition of <LocType, LocValue> that would suggest
normalizing it into more attributes.  The application treats all the right
hand side information as just location information (lots of strings).  Any
ideas?

--
===================================================================
David Masterson					Consilium, Inc.
uunet!cimshop!davidm				Mt. View, CA  94043
===================================================================
"If someone thinks they know what I said, then I didn't say it!"

may@28182.dec.com (Patrick May) (01/13/90)

>I have a database consisting of observations.  Each observation consists of a
><DataType, DataValue> pair (the observation might be water height or machine
>temperature or any number of other things).  Associated with each pair is a
>list of <LocationType, LocationValue> pairs that help identify where the
>observation occurred.  This can be defined simply (and rather redundantly) in
>one table like (with some example data):
> 
>	Observation
>	DataType	DataValue	LocType		LocValue
>	WaterHeight	10		Shore		South
>	    "		 "		Time		12:30
>	    "		 "		Observer	Bill
>	WaterHeight	15		Time		12:30
>	    "		 "		Observer	Fred
>	WaterTemp	60		Time		1:30
>	    "		 "		Observer	Tom
>	WaterTemp	50		Time		2:30
>	    "		 "		Observer	Gerry

>With a table like this (or more normalized), can a query be written like to
>find <DataType, DataValue> that are constrained by some <LocType, LocValue>
>pairs.

Maybe I'm misunderstanding your explanation, but from the example I see no way
to relate a time to a measurement.  Your apparent goal is a database of data
collection records composed of the following:

          Observer
          Time
          Location 
          Observation Type
          Observed Value

Each Observer may be related to many Time-Location pairs and may make many
observations at that Time-Location.  I would therefore model this system using
the following tables:

     Time.Location
          Observer_ID
          Time
          Location_ID

     Observation_Data
          <foreign key from Time.Location>
          Observation_Type
          Observation_Value

     Validation Tables for Observer, Location, and Observation_Type.

There is a one-to-many relationship between Time.Location and Observation_Data.
The primary key fields in Time.Location (all three listed) may be supplemented
with a system generated surrogate key to minimize the number of key fields in
Observation_Data (although I try to avoid such constructs).

Hope this helps,

Patrick

dberg@cod.NOSC.MIL (David I. Berg) (01/14/90)

In article <CIMSHOP!DAVIDM.90Jan11152210@uunet.UU.NET>, cimshop!davidm@uunet.UU.NET (David S. Masterson) writes:
> I have a database consisting of observations.  Each observation consists of a
> <DataType, DataValue> pair (the observation might be water height or machine
> temperature or any number of other things).  Associated with each pair is a
> list of <LocationType, LocationValue> pairs that help identify where the
> observation occurred.  This can be defined simply (and rather redundantly) in
> one table like (with some example data):
> 
>     Observation
>     DataType    DataValue    LocType        LocValue
>     WaterHeight    10        Shore        South
>         "           "        Time        12:30
>         "           "        Observer    Bill
>     WaterHeight    15        Time        12:30
>         "           "        Observer    Fred
>     WaterTemp      60        Time        1:30
>         "           "        Observer    Tom
>     WaterTemp      50        Time        2:30
>         "           "        Observer    Gerry
> 
> With a table like this (or more normalized), can a query be written like to
> find <DataType, DataValue> that are constrained by some <LocType, LocValue>
> pairs.  
--
To take the Information Modelling appoach, what you have is an object type
"Observation".  Each observation can have a number of attributes, each
with its associated value.  From the schema you presented in your posting,
I conclude that you wish to keep your data base as generic as possible,
i.e. allow for a variable number of attributes for each observation, including
those which are yet to be defined.  If this isn't so, then your
schema should look like:

    Observation(Time, WaterTemp, WaterHeight, Shore, Obersver),

end of discussion..

If this is your goal, however, (and it is not an unreasonable one) then
you've started down the right track.  When you normalize the schema 
you have presented, you will be left with those attributes which are
ALWAYS a transitive function of Observation, and those atrtributes which
are a function of the particular data items recorded during the observation.
For each observation, you need an identifier.  Date/Time of the observation
would probably be your identifier. The only other attribute which is 
a transitive function of Observation is Observer (i.e. there will always
be one and only one observer for each observation). Now you create
another object type called Observation_Data. The attributes of this 
object type are <Data_Type, Data_Value>. There is a one to many
relationship between the object types Observation and Observation_Data.
To establish the logical access path between them, you must attribute
Observation_Data with the identifier of Observation.  I recommend that 
you create an Obervation Number for each observation, particularly to answer 
the queries you have posed.  The implementation of the information model, 
then, results in the following data base
schema:
    Observation(Date/Time, Observer, Observation_No)
    Observation_Data(Observation_No, Data_Type, Data_Value).


Your data base now looks like:

    Observation(12:30,Bill,1)
    Observation(12:30,Fred,2)
    Observation(1:30,Tom,3)
    Observation(2:30,Gerry,4)

    Observation_Data(1,WaterHeight,10)
    Observation_Data(1,Shore,South)
    Observation_Data(2,WaterHeight,15)
    Observation_Data(3,WaterTemp,60)
    Observation_Data(4,WaterTemp,50)

Now, to answer the following questions:

1.  Find the WaterHeight when Time was 12:30 and Observer was Bill.
2.  Find the WaterHeight for Observer Fred and Shore is South.
3.  Find the WaterTemp for Observer (Tom or Fred) and Time was 1:30.

you would:

1. SELECT Data_Value FROM Observation_Data, Observation
    WHERE Data_Type = "WaterHeight" 
      AND Date/Time = "12:30" 
      AND Observer = "Bill"
      AND Observation_Data.Observation_No = Observation.Observation_No

2. SELECT Data_Value FROM Observation_Data
    WHERE Data_type = Water_Height
      AND Observation_Data.Observation_No = 
      (SELECT Observation.Observation_No from Observation, Observation_Data
      WHERE Observer = "Fred" 
        AND Data_Type = "Shore"
        AND Data_Value = "South",
        AND Observation_Data.Observation_No = Observation.Observation_No)

3. SELECT Data_Value FROM Observation_Data, Observation
    WHERE Data_Type = "WaterTemp" 
      AND Date/Time = "1:30" 
      AND Observer IN ("Tom","Fred")
      AND Observation_Data.Observation_No = Observation.Observation_No
-- 
David I. Berg (dberg@nosc.mil)                    GENISYS Information Systems
MILNET: dberg@nosc.mil                            4250 Pacific Hwy #118 
UUCP:   {akgua decvax dcdwest ucbvax}!            San Diego, CA 92110
         sdcsvax!noscvax!dberg                    (619) 226-1122