[comp.databases] Help::: SQL group by/having... Questions!!!

felicia@miller.cs.uwm.edu (05/15/91)

I thought I understood SQL's group by -having clauses, but they don't
work like I think....

I hope someone would explain to me why the following SQL queries have
such results: )


Assuming table created is:
==========================
Name: pmtbl   
-----------------------------------------
pn      char(8)
pp      char(8)
txt     char(25)
..
..
-----------------------------------------

And, some rows existed in table are:
-----------------------------------------
'    004', NULL, 'Row p004'
'    001', NULL, 'Row p001'
'    005', NULL, 'Row p005'
'    001', NULL, 'Row p001'
'    001', NULL, 'Row p001-3'

-------------------------------------------


Qry 1: 
      select pn from pmtbl group by pn;
::->
-------------------------------------------
    001
    004
    005
 
(total of 3 rows)

Qry 2: 
      select pn, txt from pmtbl group by pn;
::->
------------------------------------------------
    001, Row p001
    001, Row p001
    001, Row p001-3
    004, Row p004
    005, Row p005

(total of 5 rows)

Qry 3: 
      select pn, txt from pmtbl group by pn, txt;
::->
------------------------------------------------
    001, Row p001
    001, Row p001
    001, Row p001-3
    004, Row p004
    005, Row p005

(total of 5 rows)


QUESTIONS:
==========

q1) Why do Qry1 and Qry2 have different results?

q2) If in Qry 3, selected only pn column, should the result be the same?

q3) Does the clause,'group by Col1, Col2' cause the query to retrieve only
    the unique row of (Col1 + Col2) ?

q4) Comparing these two queries:
     select <col1>, <col2> from <tbl1> where ( <conditions C1> )
            group by <col3>

    and
 
     select <col1>, <col2> from <tbl1> 
            group by <col3>
            having ( <conditions C1> )

    Should one of these queries always retrieve faster than the 

jfr@locus.com (Jon Rosen) (05/16/91)

In article <12146@uwm.edu> felicia@miller.cs.uwm.edu writes:
>
>I thought I understood SQL's group by -having clauses, but they don't
>work like I think....

  But they do work like they do ;-)

>I hope someone would explain to me why the following SQL queries have
>such results: )
 
  I will try...     
 
  However, this will depend greatly on what SQL-based data base you
  are using.  As you will see, I believe that some of these examples
  you have given are in fact not correct, at least according to ANSI,
  DB2 and most other SQLs I have used.  

>Assuming table created is:
>Name: pmtbl   
>-----------------------------------------
>pn      char(8)
>pp      char(8)
>txt     char(25)
>..
>-----------------------------------------
>And, some rows existed in table are:
>-----------------------------------------
>'    004', NULL, 'Row p004'
>'    001', NULL, 'Row p001'
>'    005', NULL, 'Row p005'
>'    001', NULL, 'Row p001'
>'    001', NULL, 'Row p001-3'
>-------------------------------------------
>Qry 1: 
>      select pn from pmtbl group by pn;
>-------------------------------------------
>    001
>    004
>    005
 
  This is correct.  The group by in this case acts just like
  a query "select distinct pn from pmtbl" which by the way
  may actually execute faster 

>Qry 2: 
>      select pn, txt from pmtbl group by pn;
>------------------------------------------------
>    001, Row p001
>    001, Row p001
>    001, Row p001-3
>    004, Row p004
>    005, Row p005
 
  This is, in most systems, syntactically incorrect.  If it is
  permitted, then the group by clause must be ignored  (or it
  is erroneously being used solely as a replacement for order by). 
  
  According to ANSI SQL, the use of a group by clause requires
  that all columns used in the select clause must be
  single-valued... either columns in the GROUP BY, constants or
  aggregates (i.e., SUM, MAX, MIN, COUNT, etc).  This is 
  because the purpose of the group by clause is to collapse
  all rows in the query that are common to a set of group by
  clause columns into a single row.  If all you are
  trying to do here is sort the columns, use ORDER BY, not
  GROUP BY.

>Qry 3: 
>      select pn, txt from pmtbl group by pn, txt;
>------------------------------------------------
>    001, Row p001
>    001, Row p001
>    001, Row p001-3
>    004, Row p004
>    005, Row p005
 
  This example is consistent with Qry 1 since there are no
  columns that are in the SELECT that are not in the
  GROUP BY.  Again, unnecessary and SELECT DISTINCT would
  work as well (and probably faster).

>QUESTIONS:
>==========
>q1) Why do Qry1 and Qry2 have different results?
 
  Explained above.  The group by only works if you have
  aggregates and it probably should be flagged as a syntax error.

>q2) If in Qry 3, selected only pn column, should the result be the same?
 
  No.  The results would be 001, 001, 001, 004, 005... I.e., the pn
  column would be the same, the txt column would be dropped.

>q3) Does the clause,'group by Col1, Col2' cause the query to retrieve only
>    the unique row of (Col1 + Col2) ?
 
  GROUP BY, as noted before, causes each set of values to be considered
  as a group.  ANSI states that all columns in the SELECT must then be
  "single-valued" with respect to the group.   See above for an explanation.
  
>q4) Comparing these two queries:
>     select <col1>, <col2> from <tbl1> where ( <conditions C1> )
>            group by <col3>
>
>    and
> 
>     select <col1>, <col2> from <tbl1> 
>            group by <col3>
>            having ( <conditions C1> )
>

  This is truly confusing.  The conditions in the first example are executed
  against each single row of the table.  The conditions in the second example
  with the HAVING are executed against the rows in the grouped result table
  after the group by clause has been applied.  Usually, HAVING is only used
  to compare aggregates:

      SELECT REGION, SUM(SALES) FROM TBL
      GROUP BY REGION
      HAVING SUM(SALES)>10000

  Note that this this is ENTIRELY different from:

      SELECT REGION, SUM(SALES) FROM TBL
      WHERE SALES > 10000
      GROUP BY REGION
 
  The first query gives you a list of regions (and their total sales)
  where the total sales is more than 10000.   The second query gives
  you a list of regions with (lets say) offices (the individual rows)
  having sales greater than 10000 and the total sales of those offices.
 
  I believe that your confusion is caused by the use of GROUP BY to
  do sorting, not grouping.  ORDER BY merely sorts the rows before
  returning them. GROUP BY does much more.
 
  PS, in some systems, GROUP BY without an ORDER BY may NOT actually
  order the rows the way you expect.  In most systems, the GROUP BY
  is done after actually executing a sort, but on some systems  
  (notably Teradata's parallel processing data base computer),
  GROUP BYs in some case may be executed in parallel and then
  the resulting grouped rows merged together in random order in
  the abscence of an ORDER BY.  Note that this is not at all a
  violation of the SQL definition.  GROUP BY is NEVER specified to
  cause actually sorting, even though the need to group usually
  engenders the need to sort and then, once the rows are sorted,
  it makes no particular purpose to intentionally randomize them 
  before returning them.  So most single-processer DBMSs actually
  do (as an artifact) cause GROUP BYs to appear sorted.  
 
  I hope this has helped.  If not, please post again or e-mail
  and I will try to give you furhter info. 
 
Jon Rosen

hansene@hpcc01.HP.COM (Ed Hansen) (05/17/91)

I tried executing your queries in Ingres.

Query 1 works as you have indicated; 
Query 2 gives an error saying that 
' The columns in the SELECT clause must be contained in the
    GROUP BY clause.';
Query 3 has four rows in the output;
    +--------+-------------------------+
    |pn      |txt                      |
    +--------+-------------------------+
    |     001|Row p0001-3              |
    |     001|Row p001                 |
    |     004|Row p004                 |
    |     005|Row p005                 |
    +--------+-------------------------+
    (4 rows)

It seems that what you have is a different implementation
and it appears that your implementation gives some
unusual results. 
I wonder what the ANSI standard prescribes.
JeanMarie(jeanmar@hpcc31.corp.hp.com)