cbwalton@cs.utexas.edu (Chris Walton) (02/02/90)
Do you administer a medium-size or larger database? Do typical queries on that database include one or more joins? Would you like to help in a research project? If your answer to all 3 questions is "YES", read on: Most academic work on parallel join algorithms assumes data are uniformly distributed: all join key values occurs equally often and tuples are evenly spread across all nodes of a multi-computer system. In practice , both assumptions are false. For my Ph.D. research, I am attempting to develop more accurate models of join performance in the face of data skew. However, any such theory must be validated against real data. Thus, I'm asking 'comp.databases' readers for help in collecting profiles of real-world joins. Essentially, I need to collect histograms of number of tuples vs. key value. Ideally, there would be histograms for both inputs (inner and outer relations) and join outputs. Note that the goal is to characterize the data itself, not how frequently it's accessed or how it's stored. Contributions from any and all applications areas are welcomed -- the more variety the better. Data can be disguised to protect privacy. As this is an academic study, I can only offer the chance to aid a worthy cause [my graduation :-) ] and acknowledgement of your contribution in any publications that might arise from this work. If there is sufficient interest, I will summarize results to this group. If you have questions, suggestions, or think you can help, please send e-mail to <cbwalton@cs.utexas.edu>. Thank you for your help! Chris Walton Department of Computer Sciences University of Texas at Austin Austin, TX 78741 Telephone: 512-471-9585