Question :
I have the following query where I used a sub query since I can’t use aggregates in the where clause.
SSMS
suggested I use the HAVING
clause but then I would need to group by all the other columns (right?) which I did not want to have to do so I ended up using the sub query.
Is using the sub query the most efficient way or should I use the Having clause so I don’t have to read from the same table twice?
Is the difference negligible?
select *
from [DW_STAGE].[ETL].[Claims]
where pd_dt > (
select dateadd(m, - 12, cast(max(pd_dt) as date))
from [DW_STAGE].[ETL].[Claims]
)
Answer :
If your table’s clustered index has pd_dt
as the first column, this may be the most optimal query. You could simplify it by breaking it into two parts, like so:
DECLARE @dt date;
SELECT @dt=DATEADD(month, -12, CAST(MAX(pd_dt) AS date)) FROM DW_STAGE.ETL.Claims;
SELECT * FROM DW_STAGE.ETL.Claims WHERE pd_dt>@dt;
Things you could do to improve performance:
- Reduce the number of columns from
*
to those specific columns that you need. - Create a covering, non-clustered index for these columns, where the first column of the index is
pd_dt
orpd_dt DESC
.
I fail to see how using HAVING would let you avoid reading from the same table twice. The only way I can see HAVING could be used here is in combination with a self-join – more specifically, self-cross join. That already sounds… not very promising. And yes, you are right, you would need to put all the output columns (which in your case would be all DW_STAGE.ETL.Claims
columns) in the GROUP BY. It would probably look along the lines of this:
SELECT
t1.*
FROM
DW_STAGE.ETL.Claims AS t1
CROSS JOIN
DW_STAGE.ETL.Claims AS t2
GROUP BY
t1.column1,
t1.column2,
... /* and so on to include all t1 columns */
HAVING
t1.pd_st > DATEADD(MONTH, -12, CAST(max(pd_dt) AS date));
Obviously, the table is touched twice here – just as in your posted query, except this query is much uglier than yours.
Nevertheless, it is possible to accomplish what you want without referencing the table twice. You can use window aggregation and a derived table for that:
SELECT
*
FROM
(
SELECT
*,
max_pd_dt = MAX(pd_dt) OVER ()
FROM
DW_STAGE.ETL.Claims
) AS s
WHERE
pd_dt > DATEADD(MONTH, -12, CAST(max_pd_dt AS date))
;
Or your can substitute a CTE for the derived table:
WITH cte AS
(
SELECT
*,
max_pd_dt = MAX(pd_dt) OVER ()
FROM
DW_STAGE.ETL.Claims
) AS s
SELECT
*
FROM
cte
WHERE
pd_dt > DATEADD(MONTH, -12, CAST(max_pd_dt AS date))
;
Note that the *
in the outer query includes the max_pd_dt
column returned by the nested SELECT. To exclude it from the output, you will have to explicitly specify only the DW_STAGE.ETL.Claims
columns. That would actually be a good time to heed Daniel Hutmacher’s advice to limit your output to just the columns you really need. (And I agree with his other suggestion of creating a covering index.)