felicia@miller.cs.uwm.edu (05/15/91)
I thought I understood SQL's group by -having clauses, but they don't work like I think.... I hope someone would explain to me why the following SQL queries have such results: ) Assuming table created is: ========================== Name: pmtbl ----------------------------------------- pn char(8) pp char(8) txt char(25) .. .. ----------------------------------------- And, some rows existed in table are: ----------------------------------------- ' 004', NULL, 'Row p004' ' 001', NULL, 'Row p001' ' 005', NULL, 'Row p005' ' 001', NULL, 'Row p001' ' 001', NULL, 'Row p001-3' ------------------------------------------- Qry 1: select pn from pmtbl group by pn; ::-> ------------------------------------------- 001 004 005 (total of 3 rows) Qry 2: select pn, txt from pmtbl group by pn; ::-> ------------------------------------------------ 001, Row p001 001, Row p001 001, Row p001-3 004, Row p004 005, Row p005 (total of 5 rows) Qry 3: select pn, txt from pmtbl group by pn, txt; ::-> ------------------------------------------------ 001, Row p001 001, Row p001 001, Row p001-3 004, Row p004 005, Row p005 (total of 5 rows) QUESTIONS: ========== q1) Why do Qry1 and Qry2 have different results? q2) If in Qry 3, selected only pn column, should the result be the same? q3) Does the clause,'group by Col1, Col2' cause the query to retrieve only the unique row of (Col1 + Col2) ? q4) Comparing these two queries: select <col1>, <col2> from <tbl1> where ( <conditions C1> ) group by <col3> and select <col1>, <col2> from <tbl1> group by <col3> having ( <conditions C1> ) Should one of these queries always retrieve faster than the
jfr@locus.com (Jon Rosen) (05/16/91)
In article <12146@uwm.edu> felicia@miller.cs.uwm.edu writes: > >I thought I understood SQL's group by -having clauses, but they don't >work like I think.... But they do work like they do ;-) >I hope someone would explain to me why the following SQL queries have >such results: ) I will try... However, this will depend greatly on what SQL-based data base you are using. As you will see, I believe that some of these examples you have given are in fact not correct, at least according to ANSI, DB2 and most other SQLs I have used. >Assuming table created is: >Name: pmtbl >----------------------------------------- >pn char(8) >pp char(8) >txt char(25) >.. >----------------------------------------- >And, some rows existed in table are: >----------------------------------------- >' 004', NULL, 'Row p004' >' 001', NULL, 'Row p001' >' 005', NULL, 'Row p005' >' 001', NULL, 'Row p001' >' 001', NULL, 'Row p001-3' >------------------------------------------- >Qry 1: > select pn from pmtbl group by pn; >------------------------------------------- > 001 > 004 > 005 This is correct. The group by in this case acts just like a query "select distinct pn from pmtbl" which by the way may actually execute faster >Qry 2: > select pn, txt from pmtbl group by pn; >------------------------------------------------ > 001, Row p001 > 001, Row p001 > 001, Row p001-3 > 004, Row p004 > 005, Row p005 This is, in most systems, syntactically incorrect. If it is permitted, then the group by clause must be ignored (or it is erroneously being used solely as a replacement for order by). According to ANSI SQL, the use of a group by clause requires that all columns used in the select clause must be single-valued... either columns in the GROUP BY, constants or aggregates (i.e., SUM, MAX, MIN, COUNT, etc). This is because the purpose of the group by clause is to collapse all rows in the query that are common to a set of group by clause columns into a single row. If all you are trying to do here is sort the columns, use ORDER BY, not GROUP BY. >Qry 3: > select pn, txt from pmtbl group by pn, txt; >------------------------------------------------ > 001, Row p001 > 001, Row p001 > 001, Row p001-3 > 004, Row p004 > 005, Row p005 This example is consistent with Qry 1 since there are no columns that are in the SELECT that are not in the GROUP BY. Again, unnecessary and SELECT DISTINCT would work as well (and probably faster). >QUESTIONS: >========== >q1) Why do Qry1 and Qry2 have different results? Explained above. The group by only works if you have aggregates and it probably should be flagged as a syntax error. >q2) If in Qry 3, selected only pn column, should the result be the same? No. The results would be 001, 001, 001, 004, 005... I.e., the pn column would be the same, the txt column would be dropped. >q3) Does the clause,'group by Col1, Col2' cause the query to retrieve only > the unique row of (Col1 + Col2) ? GROUP BY, as noted before, causes each set of values to be considered as a group. ANSI states that all columns in the SELECT must then be "single-valued" with respect to the group. See above for an explanation. >q4) Comparing these two queries: > select <col1>, <col2> from <tbl1> where ( <conditions C1> ) > group by <col3> > > and > > select <col1>, <col2> from <tbl1> > group by <col3> > having ( <conditions C1> ) > This is truly confusing. The conditions in the first example are executed against each single row of the table. The conditions in the second example with the HAVING are executed against the rows in the grouped result table after the group by clause has been applied. Usually, HAVING is only used to compare aggregates: SELECT REGION, SUM(SALES) FROM TBL GROUP BY REGION HAVING SUM(SALES)>10000 Note that this this is ENTIRELY different from: SELECT REGION, SUM(SALES) FROM TBL WHERE SALES > 10000 GROUP BY REGION The first query gives you a list of regions (and their total sales) where the total sales is more than 10000. The second query gives you a list of regions with (lets say) offices (the individual rows) having sales greater than 10000 and the total sales of those offices. I believe that your confusion is caused by the use of GROUP BY to do sorting, not grouping. ORDER BY merely sorts the rows before returning them. GROUP BY does much more. PS, in some systems, GROUP BY without an ORDER BY may NOT actually order the rows the way you expect. In most systems, the GROUP BY is done after actually executing a sort, but on some systems (notably Teradata's parallel processing data base computer), GROUP BYs in some case may be executed in parallel and then the resulting grouped rows merged together in random order in the abscence of an ORDER BY. Note that this is not at all a violation of the SQL definition. GROUP BY is NEVER specified to cause actually sorting, even though the need to group usually engenders the need to sort and then, once the rows are sorted, it makes no particular purpose to intentionally randomize them before returning them. So most single-processer DBMSs actually do (as an artifact) cause GROUP BYs to appear sorted. I hope this has helped. If not, please post again or e-mail and I will try to give you furhter info. Jon Rosen
hansene@hpcc01.HP.COM (Ed Hansen) (05/17/91)
I tried executing your queries in Ingres. Query 1 works as you have indicated; Query 2 gives an error saying that ' The columns in the SELECT clause must be contained in the GROUP BY clause.'; Query 3 has four rows in the output; +--------+-------------------------+ |pn |txt | +--------+-------------------------+ | 001|Row p0001-3 | | 001|Row p001 | | 004|Row p004 | | 005|Row p005 | +--------+-------------------------+ (4 rows) It seems that what you have is a different implementation and it appears that your implementation gives some unusual results. I wonder what the ANSI standard prescribes. JeanMarie(jeanmar@hpcc31.corp.hp.com)