SQL Server: Query Efficiency Having vs Subquery

Posted on

Question :

I have the following query where I used a sub query since I can’t use aggregates in the where clause.

SSMS suggested I use the HAVING clause but then I would need to group by all the other columns (right?) which I did not want to have to do so I ended up using the sub query.

Is using the sub query the most efficient way or should I use the Having clause so I don’t have to read from the same table twice?

Is the difference negligible?

select *
from [DW_STAGE].[ETL].[Claims]
where pd_dt > (
        select dateadd(m, - 12, cast(max(pd_dt) as date))
        from [DW_STAGE].[ETL].[Claims]
        )

Answer :

If your table’s clustered index has pd_dt as the first column, this may be the most optimal query. You could simplify it by breaking it into two parts, like so:

DECLARE @dt date;
SELECT @dt=DATEADD(month, -12, CAST(MAX(pd_dt) AS date)) FROM DW_STAGE.ETL.Claims;

SELECT * FROM DW_STAGE.ETL.Claims WHERE pd_dt>@dt;

Things you could do to improve performance:

  • Reduce the number of columns from * to those specific columns that you need.
  • Create a covering, non-clustered index for these columns, where the first column of the index is pd_dt or pd_dt DESC.

I fail to see how using HAVING would let you avoid reading from the same table twice. The only way I can see HAVING could be used here is in combination with a self-join – more specifically, self-cross join. That already sounds… not very promising. And yes, you are right, you would need to put all the output columns (which in your case would be all DW_STAGE.ETL.Claims columns) in the GROUP BY. It would probably look along the lines of this:

SELECT
  t1.*
FROM
  DW_STAGE.ETL.Claims AS t1
  CROSS JOIN
  DW_STAGE.ETL.Claims AS t2
GROUP BY
  t1.column1,
  t1.column2,
  ... /* and so on to include all t1 columns */
HAVING
  t1.pd_st > DATEADD(MONTH, -12, CAST(max(pd_dt) AS date));

Obviously, the table is touched twice here – just as in your posted query, except this query is much uglier than yours.

Nevertheless, it is possible to accomplish what you want without referencing the table twice. You can use window aggregation and a derived table for that:

SELECT
  *
FROM
  (
    SELECT
      *,
      max_pd_dt = MAX(pd_dt) OVER ()
    FROM
      DW_STAGE.ETL.Claims
  ) AS s
WHERE
  pd_dt > DATEADD(MONTH, -12, CAST(max_pd_dt AS date))
;

Or your can substitute a CTE for the derived table:

WITH cte AS
  (
    SELECT
      *,
      max_pd_dt = MAX(pd_dt) OVER ()
    FROM
      DW_STAGE.ETL.Claims
  ) AS s
SELECT
  *
FROM
  cte
WHERE
  pd_dt > DATEADD(MONTH, -12, CAST(max_pd_dt AS date))
;

Note that the * in the outer query includes the max_pd_dt column returned by the nested SELECT. To exclude it from the output, you will have to explicitly specify only the DW_STAGE.ETL.Claims columns. That would actually be a good time to heed Daniel Hutmacher’s advice to limit your output to just the columns you really need. (And I agree with his other suggestion of creating a covering index.)

Leave a Reply

Your email address will not be published. Required fields are marked *