I have the following query where I used a sub query since I can’t use aggregates in the where clause.
SSMS suggested I use the
HAVING clause but then I would need to group by all the other columns (right?) which I did not want to have to do so I ended up using the sub query.
Is using the sub query the most efficient way or should I use the Having clause so I don’t have to read from the same table twice?
Is the difference negligible?
select * from [DW_STAGE].[ETL].[Claims] where pd_dt > ( select dateadd(m, - 12, cast(max(pd_dt) as date)) from [DW_STAGE].[ETL].[Claims] )
If your table’s clustered index has
pd_dt as the first column, this may be the most optimal query. You could simplify it by breaking it into two parts, like so:
DECLARE @dt date; SELECT @dt=DATEADD(month, -12, CAST(MAX(pd_dt) AS date)) FROM DW_STAGE.ETL.Claims; SELECT * FROM DW_STAGE.ETL.Claims WHERE pd_dt>@dt;
Things you could do to improve performance:
- Reduce the number of columns from
*to those specific columns that you need.
- Create a covering, non-clustered index for these columns, where the first column of the index is
I fail to see how using HAVING would let you avoid reading from the same table twice. The only way I can see HAVING could be used here is in combination with a self-join – more specifically, self-cross join. That already sounds… not very promising. And yes, you are right, you would need to put all the output columns (which in your case would be all
DW_STAGE.ETL.Claims columns) in the GROUP BY. It would probably look along the lines of this:
SELECT t1.* FROM DW_STAGE.ETL.Claims AS t1 CROSS JOIN DW_STAGE.ETL.Claims AS t2 GROUP BY t1.column1, t1.column2, ... /* and so on to include all t1 columns */ HAVING t1.pd_st > DATEADD(MONTH, -12, CAST(max(pd_dt) AS date));
Obviously, the table is touched twice here – just as in your posted query, except this query is much uglier than yours.
Nevertheless, it is possible to accomplish what you want without referencing the table twice. You can use window aggregation and a derived table for that:
SELECT * FROM ( SELECT *, max_pd_dt = MAX(pd_dt) OVER () FROM DW_STAGE.ETL.Claims ) AS s WHERE pd_dt > DATEADD(MONTH, -12, CAST(max_pd_dt AS date)) ;
Or your can substitute a CTE for the derived table:
WITH cte AS ( SELECT *, max_pd_dt = MAX(pd_dt) OVER () FROM DW_STAGE.ETL.Claims ) AS s SELECT * FROM cte WHERE pd_dt > DATEADD(MONTH, -12, CAST(max_pd_dt AS date)) ;
Note that the
* in the outer query includes the
max_pd_dt column returned by the nested SELECT. To exclude it from the output, you will have to explicitly specify only the
DW_STAGE.ETL.Claims columns. That would actually be a good time to heed Daniel Hutmacher’s advice to limit your output to just the columns you really need. (And I agree with his other suggestion of creating a covering index.)