[comp.databases] Database problem of the week

riddle@woton.UUCP (Prentiss Riddle ) (08/11/87)

[Someone was complaining that there wasn't enough traffic in this newsgroup,
so I'll take that as a green light to raise my own little application
design quandary.  Please post replies of general or theoretical interest to
comp.databases, but send anything that's not of general interest directly
to me.  Thanks.]

I've inherited responsibility for a medium-large and growing database (say
~400 Kbytes, not including index files etc.) of experimental data which needs
to be kept around and subjected to statistical analysis. 

Currently the data is stored in a single large Informix file in which each
record consists of a key and 52 floating point lab values.  The key is a
composite of a study group number, a subject number and a time. 

	+-----------+-------------+------+-------+-------+-   -+--------+
	| group no. | subject no. | time | val 1 | val 2 | ... | val 52 |
	+-----------+-------------+------+-------+-------+-   -+--------+

So far so good.  What makes me wonder if this is the best way to store the
data is that over half of the lab values are nulls (stored as an impossibly
large real number).  In some of the fields, as a matter of fact, over 95% of
the records have no data.  One reason for this is that half of the fields
contain what are called "extra" variables: since different study groups are
subjected to different experiments, the meanings of these "extra" fields
differ from study group to study group (which is *very* confusing!) and most
study groups don't need them all.  But even in the fields with fixed
meanings, measurements are not always recorded for every subject at every
time point. 

The database is still growing and is becoming large enough to be unwieldy.
It seems to me that there ought to be a logical way to take advantage of
the sparseness of the database by representing it in some other form.  I've
had three lines of thought on this, none of which has yet borne fruit:

PLAN A:  Instead of using very wide records with 52 lab values, why not use
short records each containing a key and a single lab value?  Each record
might look like this:

	+-----------+-------------+------+-------+-------+
	| group no. | subject no. | time | label | value |
	+-----------+-------------+------+-------+-------+

The "label" would be a short (3-4 byte) string explaining the meaning of the
"value" field.  Obviously most of our storage here goes for keys (and
indexes!), but it *might* still take less storage than the current scheme. 
It would also remove confusion over the meaning of the "extra" experimental
values and allow the choice of experimental variables recorded to be tailored
precisely to each study. 

PLAN B: I'm told that the rigid structure of this problem would lend itself
well to a hierarchical database.  Unfortunately I don't know much about what
that means, and it doesn't help us much if all we have is Informix.  Is there
a simple public-domain hierarchical database out there somewhere? 

PLAN C: This may not really be a problem for a DBMS.  The kinds of queries we
tend to do are mostly simple, and we'd be perfectly happy to separate the
study groups into distinct files where they wouldn't interact at all.  What I
sometimes wish I had was a simple program for manipulating a sparse 3-D
matrix with subject on one axis, time on another and variable on a third.  It
would need a user-friendly screen-oriented interface for entering and
correcting data, and the ability to dump one- and two-dimensional slices of
the matrix into ascii files to be fed to statistics programs.  That sounds a
lot like a 3-D spreadsheet to me, but I don't know where to find such a beast
for little or no money.  Also I'm not very familiar with the internals of
spreadsheets.  What sorts of data structures do they use -- hashing, perhaps?

Any comments, advice or discussion would be welcome.

--- Prentiss Riddle ("Aprendiz de todo, maestro de nada.")
--- Opinions expressed are not necessarily those of Shriners Burns Institute.
--- riddle@woton.UUCP  {ihnp4,harvard,seismo}!ut-sally!im4u!woton!riddle

sysop@stech.UUCP (Jan Harrington) (08/13/87)

in article <923@woton.UUCP>, riddle@woton.UUCP (Prentiss Riddle ) says:
> Quandary: "Bob"
> 
> Currently the data is stored in a single large Informix file in which each
> record consists of a key and 52 floating point lab values.  The key is a
> composite of a study group number, a subject number and a time. 
> 

What you describe is a classic relational database design problem.  Whoever
designed the database thought of it as if he or she were using a flat file
manager, not a database management system.  Your PLAN A looks like the best
solution.  You might want to do the following:

	table 1: Subjects (subject number, name, phone, age, sex, etc. -
		any data that don't change from one experimental session
		to another)

	table 2: Groups (group number, description, etc. - any data that don't
		change from one experimental session to another)

	table 3: Experimental data (subject number, group number, date, time,
		value id, value)

If subjects remain with the same group throughout their participation in
the experiment, then the group number can be part of table 1 and isn't needed
in table 3.

Because you have so many nulls, this type of design will save you a great
deal of disk space, since rows in table 3 will only be added when actual
data exist.  You will find that this is more flexible in terms of retrieval
than a hierarchical or network design, as well.

Jan Harrington, sysop
Scholastech Telecommunications
seismo!husc6!amcad!stech!sysop

allbery@ncoast.UUCP (08/16/87)

As quoted from <923@woton.UUCP> by riddle@woton.UUCP (Prentiss Riddle ):
+---------------
| I've inherited responsibility for a medium-large and growing database (say
| ~400 Kbytes, not including index files etc.) of experimental data which needs
| to be kept around and subjected to statistical analysis. 
| 
| Currently the data is stored in a single large Informix file in which each
| record consists of a key and 52 floating point lab values.  The key is a
| composite of a study group number, a subject number and a time. 
| 
| 	+-----------+-------------+------+-------+-------+-   -+--------+
| 	| group no. | subject no. | time | val 1 | val 2 | ... | val 52 |
| 	+-----------+-------------+------+-------+-------+-   -+--------+
+---------------

I'd use two tables (files):

test
  test-id primary
  group-no
  subject-no
  time

result
  test-id
  result-no
  result-id primary composite (test-id result-no)
  result-val

+---------------
| PLAN C: This may not really be a problem for a DBMS.  The kinds of queries we
| tend to do are mostly simple, and we'd be perfectly happy to separate the
| study groups into distinct files where they wouldn't interact at all.  What I
| sometimes wish I had was a simple program for manipulating a sparse 3-D
| matrix with subject on one axis, time on another and variable on a third.  It
| would need a user-friendly screen-oriented interface for entering and
| correcting data, and the ability to dump one- and two-dimensional slices of
| the matrix into ascii files to be fed to statistics programs.  That sounds a
| lot like a 3-D spreadsheet to me, but I don't know where to find such a beast
| for little or no money.  Also I'm not very familiar with the internals of
| spreadsheets.  What sorts of data structures do they use -- hashing, perhaps?
+---------------

Well, if you used PCs, you could use Qubecalc:  a shareware 3-D Lotus-lookalike
written in Turbo Pascal.
-- 
 Brandon S. Allbery, moderator of comp.sources.misc and comp.binaries.ibm.pc
  {{harvard,mit-eddie}!necntc,well!hoptoad,sun!mandrill!hal}!ncoast!allbery
ARPA: necntc!ncoast!allbery@harvard.harvard.edu  Fido: 157/502  MCI: BALLBERY
   <<ncoast Public Access UNIX: +1 216 781 6201 24hrs. 300/1200/2400 baud>>
** Site "cwruecmp" is changing its name to "mandrill".  Please re-address **
*** all mail to ncoast to pass through "mandrill" instead of "cwruecmp". ***

garyp@cognos.uucp (Gary Puckering) (08/21/87)

In article <923@woton.UUCP> riddle@woton.UUCP (Prentiss Riddle ) writes:
>Currently the data is stored in a single large Informix file in which each
>record consists of a key and 52 floating point lab values.  The key is a
>composite of a study group number, a subject number and a time. 
>
>	+-----------+-------------+------+-------+-------+-   -+--------+
>	| group no. | subject no. | time | val 1 | val 2 | ... | val 52 |
>	+-----------+-------------+------+-------+-------+-   -+--------+
>
>So far so good.  What makes me wonder if this is the best way to store the
>data is that over half of the lab values are nulls (stored as an impossibly
>large real number).  In some of the fields, as a matter of fact, over 95% of
>the records have no data.  One reason for this is that half of the fields
>contain what are called "extra" variables: since different study groups are
>subjected to different experiments, the meanings of these "extra" fields
>differ from study group to study group (which is *very* confusing!) and most
>study groups don't need them all.  But even in the fields with fixed
>meanings, measurements are not always recorded for every subject at every
>time point. 

One technique you could use is to try to classify the study groups into
different subclasses each with its own unique set of properties.  For
example:

  experiment(expId,groupNo,subjectNo,time) 

  animalExperiment(animalExpId,whiteCellCount,respirationRate,basalTemp)

  birdExperiment(birdExpId,wingBeatRate)

  plantExperiment(plantExpId,transpirationRate,soil_ph)

In this example, Ive introduced a surrogate key, expId, to substitute for
the compound key (groupNo,subjectNo,time).  This surrogate can simply
be a unique number generated  whenever a new experiment is inserted.
In addition, you will probably want to define a unique index on the 
columns (groupNo,subjectNo,time).

The other relations are specializations of the experiment relation.
In fact, their primary keys (animalExpId, birdExpId and plantExpId)
are defined on the same domain as expId.  If an experiment is an
animal experiment, then you will insert a row into both the experiment
relation and the animalExperiment relation.  The expId and the
animalExpId will have the same value.  If it is a bird
experiment (a bird is an animal) then you'll insert a row into the
experiment relation, the animalExperiment relation and the
birdExperiment relation.  Again, the id's will have the same value.

To access your, data, you can join these relations in various ways.
For example:

  select *
  from planExperiment p, experiment e
  where p.plantExpId = e.expId

will get you all the plant experiment information.

Using this technique, you can probably eliminate the use of nulls to
represent inapplicable properties (when you have this case, you
probably have a new subclass).  This leaves the null feature free to
be used to represent an unknown or missing value.
-- 

Gary Puckering        3755 Riverside Dr.
Cognos Incorporated   Ottawa, Ontario    {allegra,decvax,ihnp4,linus,pyramid}
(613) 738-1440        CANADA  K1G 3N3    !utzoo!dciem!nrcaer!cognos!garyp