riddle@woton.UUCP (Prentiss Riddle ) (08/11/87)
[Someone was complaining that there wasn't enough traffic in this newsgroup, so I'll take that as a green light to raise my own little application design quandary. Please post replies of general or theoretical interest to comp.databases, but send anything that's not of general interest directly to me. Thanks.] I've inherited responsibility for a medium-large and growing database (say ~400 Kbytes, not including index files etc.) of experimental data which needs to be kept around and subjected to statistical analysis. Currently the data is stored in a single large Informix file in which each record consists of a key and 52 floating point lab values. The key is a composite of a study group number, a subject number and a time. +-----------+-------------+------+-------+-------+- -+--------+ | group no. | subject no. | time | val 1 | val 2 | ... | val 52 | +-----------+-------------+------+-------+-------+- -+--------+ So far so good. What makes me wonder if this is the best way to store the data is that over half of the lab values are nulls (stored as an impossibly large real number). In some of the fields, as a matter of fact, over 95% of the records have no data. One reason for this is that half of the fields contain what are called "extra" variables: since different study groups are subjected to different experiments, the meanings of these "extra" fields differ from study group to study group (which is *very* confusing!) and most study groups don't need them all. But even in the fields with fixed meanings, measurements are not always recorded for every subject at every time point. The database is still growing and is becoming large enough to be unwieldy. It seems to me that there ought to be a logical way to take advantage of the sparseness of the database by representing it in some other form. I've had three lines of thought on this, none of which has yet borne fruit: PLAN A: Instead of using very wide records with 52 lab values, why not use short records each containing a key and a single lab value? Each record might look like this: +-----------+-------------+------+-------+-------+ | group no. | subject no. | time | label | value | +-----------+-------------+------+-------+-------+ The "label" would be a short (3-4 byte) string explaining the meaning of the "value" field. Obviously most of our storage here goes for keys (and indexes!), but it *might* still take less storage than the current scheme. It would also remove confusion over the meaning of the "extra" experimental values and allow the choice of experimental variables recorded to be tailored precisely to each study. PLAN B: I'm told that the rigid structure of this problem would lend itself well to a hierarchical database. Unfortunately I don't know much about what that means, and it doesn't help us much if all we have is Informix. Is there a simple public-domain hierarchical database out there somewhere? PLAN C: This may not really be a problem for a DBMS. The kinds of queries we tend to do are mostly simple, and we'd be perfectly happy to separate the study groups into distinct files where they wouldn't interact at all. What I sometimes wish I had was a simple program for manipulating a sparse 3-D matrix with subject on one axis, time on another and variable on a third. It would need a user-friendly screen-oriented interface for entering and correcting data, and the ability to dump one- and two-dimensional slices of the matrix into ascii files to be fed to statistics programs. That sounds a lot like a 3-D spreadsheet to me, but I don't know where to find such a beast for little or no money. Also I'm not very familiar with the internals of spreadsheets. What sorts of data structures do they use -- hashing, perhaps? Any comments, advice or discussion would be welcome. --- Prentiss Riddle ("Aprendiz de todo, maestro de nada.") --- Opinions expressed are not necessarily those of Shriners Burns Institute. --- riddle@woton.UUCP {ihnp4,harvard,seismo}!ut-sally!im4u!woton!riddle
sysop@stech.UUCP (Jan Harrington) (08/13/87)
in article <923@woton.UUCP>, riddle@woton.UUCP (Prentiss Riddle ) says: > Quandary: "Bob" > > Currently the data is stored in a single large Informix file in which each > record consists of a key and 52 floating point lab values. The key is a > composite of a study group number, a subject number and a time. > What you describe is a classic relational database design problem. Whoever designed the database thought of it as if he or she were using a flat file manager, not a database management system. Your PLAN A looks like the best solution. You might want to do the following: table 1: Subjects (subject number, name, phone, age, sex, etc. - any data that don't change from one experimental session to another) table 2: Groups (group number, description, etc. - any data that don't change from one experimental session to another) table 3: Experimental data (subject number, group number, date, time, value id, value) If subjects remain with the same group throughout their participation in the experiment, then the group number can be part of table 1 and isn't needed in table 3. Because you have so many nulls, this type of design will save you a great deal of disk space, since rows in table 3 will only be added when actual data exist. You will find that this is more flexible in terms of retrieval than a hierarchical or network design, as well. Jan Harrington, sysop Scholastech Telecommunications seismo!husc6!amcad!stech!sysop
allbery@ncoast.UUCP (08/16/87)
As quoted from <923@woton.UUCP> by riddle@woton.UUCP (Prentiss Riddle ): +--------------- | I've inherited responsibility for a medium-large and growing database (say | ~400 Kbytes, not including index files etc.) of experimental data which needs | to be kept around and subjected to statistical analysis. | | Currently the data is stored in a single large Informix file in which each | record consists of a key and 52 floating point lab values. The key is a | composite of a study group number, a subject number and a time. | | +-----------+-------------+------+-------+-------+- -+--------+ | | group no. | subject no. | time | val 1 | val 2 | ... | val 52 | | +-----------+-------------+------+-------+-------+- -+--------+ +--------------- I'd use two tables (files): test test-id primary group-no subject-no time result test-id result-no result-id primary composite (test-id result-no) result-val +--------------- | PLAN C: This may not really be a problem for a DBMS. The kinds of queries we | tend to do are mostly simple, and we'd be perfectly happy to separate the | study groups into distinct files where they wouldn't interact at all. What I | sometimes wish I had was a simple program for manipulating a sparse 3-D | matrix with subject on one axis, time on another and variable on a third. It | would need a user-friendly screen-oriented interface for entering and | correcting data, and the ability to dump one- and two-dimensional slices of | the matrix into ascii files to be fed to statistics programs. That sounds a | lot like a 3-D spreadsheet to me, but I don't know where to find such a beast | for little or no money. Also I'm not very familiar with the internals of | spreadsheets. What sorts of data structures do they use -- hashing, perhaps? +--------------- Well, if you used PCs, you could use Qubecalc: a shareware 3-D Lotus-lookalike written in Turbo Pascal. -- Brandon S. Allbery, moderator of comp.sources.misc and comp.binaries.ibm.pc {{harvard,mit-eddie}!necntc,well!hoptoad,sun!mandrill!hal}!ncoast!allbery ARPA: necntc!ncoast!allbery@harvard.harvard.edu Fido: 157/502 MCI: BALLBERY <<ncoast Public Access UNIX: +1 216 781 6201 24hrs. 300/1200/2400 baud>> ** Site "cwruecmp" is changing its name to "mandrill". Please re-address ** *** all mail to ncoast to pass through "mandrill" instead of "cwruecmp". ***
garyp@cognos.uucp (Gary Puckering) (08/21/87)
In article <923@woton.UUCP> riddle@woton.UUCP (Prentiss Riddle ) writes: >Currently the data is stored in a single large Informix file in which each >record consists of a key and 52 floating point lab values. The key is a >composite of a study group number, a subject number and a time. > > +-----------+-------------+------+-------+-------+- -+--------+ > | group no. | subject no. | time | val 1 | val 2 | ... | val 52 | > +-----------+-------------+------+-------+-------+- -+--------+ > >So far so good. What makes me wonder if this is the best way to store the >data is that over half of the lab values are nulls (stored as an impossibly >large real number). In some of the fields, as a matter of fact, over 95% of >the records have no data. One reason for this is that half of the fields >contain what are called "extra" variables: since different study groups are >subjected to different experiments, the meanings of these "extra" fields >differ from study group to study group (which is *very* confusing!) and most >study groups don't need them all. But even in the fields with fixed >meanings, measurements are not always recorded for every subject at every >time point. One technique you could use is to try to classify the study groups into different subclasses each with its own unique set of properties. For example: experiment(expId,groupNo,subjectNo,time) animalExperiment(animalExpId,whiteCellCount,respirationRate,basalTemp) birdExperiment(birdExpId,wingBeatRate) plantExperiment(plantExpId,transpirationRate,soil_ph) In this example, Ive introduced a surrogate key, expId, to substitute for the compound key (groupNo,subjectNo,time). This surrogate can simply be a unique number generated whenever a new experiment is inserted. In addition, you will probably want to define a unique index on the columns (groupNo,subjectNo,time). The other relations are specializations of the experiment relation. In fact, their primary keys (animalExpId, birdExpId and plantExpId) are defined on the same domain as expId. If an experiment is an animal experiment, then you will insert a row into both the experiment relation and the animalExperiment relation. The expId and the animalExpId will have the same value. If it is a bird experiment (a bird is an animal) then you'll insert a row into the experiment relation, the animalExperiment relation and the birdExperiment relation. Again, the id's will have the same value. To access your, data, you can join these relations in various ways. For example: select * from planExperiment p, experiment e where p.plantExpId = e.expId will get you all the plant experiment information. Using this technique, you can probably eliminate the use of nulls to represent inapplicable properties (when you have this case, you probably have a new subclass). This leaves the null feature free to be used to represent an unknown or missing value. -- Gary Puckering 3755 Riverside Dr. Cognos Incorporated Ottawa, Ontario {allegra,decvax,ihnp4,linus,pyramid} (613) 738-1440 CANADA K1G 3N3 !utzoo!dciem!nrcaer!cognos!garyp