[comp.databases] Real join data needed

cbwalton@cs.utexas.edu (Chris Walton) (02/02/90)

Do you administer a medium-size or larger database? 
Do typical queries on that database include one or more joins? 
Would you like to help in a research project?

If your answer to all 3 questions is "YES", read on:

Most academic work on parallel join algorithms assumes data are
uniformly distributed: all join key values occurs equally often and
tuples are evenly spread across all nodes of a
multi-computer system. In practice , both assumptions are false. For my
Ph.D. research, I am attempting to develop more accurate models of
join performance in the face of data skew. However, any such theory must
be validated against real data. Thus, I'm asking 'comp.databases' readers
for help in collecting profiles of real-world joins.

Essentially, I need to collect histograms of number of tuples vs. key
value. Ideally, there would be histograms for both inputs (inner and
outer relations) and join outputs. Note that the goal is to characterize
the data itself, not how frequently it's accessed or how it's stored.
Contributions from any and all applications areas are welcomed -- the
more variety the better. Data can be disguised to protect privacy.

As this is an academic study, I can only offer the chance to aid
a worthy cause [my graduation :-) ] and acknowledgement of your
contribution in any publications that might arise from this work.
If there is sufficient interest, I will summarize results to this
group.

If you have questions, suggestions, or think you can help, please send
e-mail to <cbwalton@cs.utexas.edu>. Thank you for your help!

Chris Walton
Department of Computer Sciences
University of Texas at Austin
Austin, TX 78741
Telephone: 512-471-9585