Question :
First of all, I can describe the schema of the table, but my table itself is around 80GB so I am hoping some expert eyes can look at any obvious optimizations that can be performed. I am carrying out some activity-based analysis to understand a set of users from an in-house virtual gaming environment.
I have two base tables:
EventTable: Moderately big at 10GB
Activity_ID UserName ActivityStart ActivityEnd Badge
ActivityTable: Super big at 100GB
UserName ActivityTime Game1Points Game2Points
In short, I am trying the following:
- For each activity in the EventTable:
- Compute the median of Game1Points, Game2Points
for two cases: for all entries in the ActivityTable
that were present since 1 hour before the activity
and for all entries stored during the activity itself
Therefore, the final table I am expecting is this:
ResultTable:
Activity_ID Badge Game1_Before Game1_During Game2_Before Game2_During
This clearly looks like a job for CURSOR
s so I first wrote my query using that approach. It was running and I could see the progress but after reading so many horror stories, I decided to re-write my 200-line CURSOR-based approach with a set-based approach. This works for small tables perfectly. However, for tables of my size, it seems to be stalling (or at least I am not sure how long I should run it). I ran the Display Estimated Plan
and added the relevant indices as well. Now my problem is that I am not even sure how long this query will take to run (unlike the CURSOR
approach where I could print some messages and see the progress).
Can someone tell me if there is a better way to do this? My current query is given below. While the query itself looks long, the structure is quite simple: First, I define a CTE that gets all the points associated with a particular event. Using this CTE, I calculate two separate medians.
WITH ST AS
(
-- Fetching the required data: Get entries that exist
-- since 1 hour before the activity until the activity ended
SELECT ROW_NUMBER() OVER(ORDER BY Activity_ID) AS ID
, Activity_ID
, Badge
, (CASE WHEN
Y.ActivityTime BETWEEN DATEADD(HOUR, -1, dbo.ROUNDTIME(X.ActivityStart) AND dbo.ROUNDTIME(X.ActivityStart)
THEN 'Game1ActivityBefore'
WHEN
Y.ActivityTime BETWEEN dbo.ROUNDTIME(X.ActivityStart) AND dbo.ROUNDTIME(X.ActivityFinish)
THEN 'Game1ActivityDuring'
END) Game1Type
, (CASE WHEN
Y.ActivityTime BETWEEN DATEADD(HOUR, -1, dbo.ROUNDTIME(X.ActivityStart)) AND dbo.ROUNDTIME(X.ActivityStart)
THEN 'Game2ActivityBefore'
WHEN
Y.ActivityTime BETWEEN dbo.ROUNDTIME(X.ActivityStart) AND dbo.ROUNDTIME(X.ActivityFinish)
THEN 'Game2ActivityDuring'
END) Game2Type
, CAST(Game1Points AS bigint) Game1Points
, CAST(Game2Poins AS bigint) Game2Points
FROM
dbo.EventTable X
INNER JOIN dbo.ActivityTable Y
ON X.UserName = Y.UserName
WHERE
ActivityTime BETWEEN DATEADD(HOUR, -1, dbo.ROUNDTIME(X.ActivityStart))
AND dbo.ROUNDTIME(X.ActivityFinish)
AND
(
Badge LIKE 'GREEN%'
OR Badge LIKE 'RED%'
)
)
-- Eliminating any NULL values that resulted from the median computation step
SELECT Activity_ID
, Badge
, MAX(CAST([Game1ActivityBefore] AS DECIMAL)) AS 'Game1_Before'
, MAX(CAST([Game1ActivityDuring] AS DECIMAL)) AS 'Game1_During'
, MAX(CAST([Game2ActivityBefore] AS DECIMAL)) AS 'Game2_Before'
, MAX(CAST([Game2ActivityDuring] AS DECIMAL)) AS 'Game2_During'
FROM
(
-- Median computation for the two columns - Aggregation Step
SELECT Activity_ID
, Badge
, Game1Type
, Game2Type
, AVG(Game1Points) Game1Median
, AVG(Game2Points) Game2Median
FROM
(
-- Median computation for the two columns - Inner step
SELECT Activity_ID
, Badge
, Game1Type
, Game2Type
, Game1Points
, ROW_NUMBER() OVER (
PARTITION BY Activity_ID, Badge, Game1Type
ORDER BY Game1Points ASC, ID ASC) AS Game1RowAsc
, ROW_NUMBER() OVER (
PARTITION BY Activity_ID, Badge, Game1Type
ORDER BY Game1Points DESC, ID DESC) AS Game1RowDesc
, Game2Points
, ROW_NUMBER() OVER (
PARTITION BY Activity_ID, Badge, Game2Type
ORDER BY Game2Points ASC, ID ASC) AS Game2RowAsc
, ROW_NUMBER() OVER (
PARTITION BY Activity_ID, Badge, Game2Type
ORDER BY Game2Points DESC, ID DESC) AS Game2RowDesc
FROM ST TS
) X
WHERE Game1RowAsc IN (Game1RowDesc, Game1RowDesc - 1, Game1RowDesc + 1)
OR Game2RowAsc IN (Game2RowDesc, Game2RowDesc - 1, Game2RowDesc + 1)
GROUP BY Activity_ID, Badge, Game1Type, Game2Type
) Y
PIVOT
(
MAX(Game1Median)
FOR Game1Type IN ([Game1ActivityBefore], [Game1ActivityDuring])
) AS PivotTable
PIVOT
(
MAX(Game2Median)
FOR Game2Type IN ([Game2ActivityBefore], [Game2ActivityDuring])
) AS PivotTable2
GROUP BY Activity_ID, Badge
UPDATE: After the suggestion. Take 1
CREATE TABLE #ST(ID INT
, Activity_ID INT
, Game1Type CHAR(13)
, Game2Type CHAR(13)
, Game1Points DECIMAL
, Game2Points DECIMAL
)
INSERT INTO #ST
SELECT ROW_NUMBER() OVER(ORDER BY Activity_ID) AS ID
, Activity_ID
, Badge
, (CASE WHEN
Y.ActivityTime BETWEEN DATEADD(HOUR, -1, dbo.ROUNDTIME(X.ActivityStart) AND dbo.ROUNDTIME(X.ActivityStart)
THEN 'Game1ActivityBefore'
WHEN
Y.ActivityTime BETWEEN dbo.ROUNDTIME(X.ActivityStart) AND dbo.ROUNDTIME(X.ActivityFinish)
THEN 'Game1ActivityDuring'
END) Game1Type
, (CASE WHEN
Y.ActivityTime BETWEEN DATEADD(HOUR, -1, dbo.ROUNDTIME(X.ActivityStart)) AND dbo.ROUNDTIME(X.ActivityStart)
THEN 'Game2ActivityBefore'
WHEN
Y.ActivityTime BETWEEN dbo.ROUNDTIME(X.ActivityStart) AND dbo.ROUNDTIME(X.ActivityFinish)
THEN 'Game2ActivityDuring'
END) Game2Type
, CAST(Game1Points AS bigint) Game1Points
, CAST(Game2Poins AS bigint) Game2Points
FROM
dbo.EventTable X
INNER JOIN dbo.ActivityTable Y
ON X.UserName = Y.UserName
WHERE
ActivityTime BETWEEN DATEADD(HOUR, -1, dbo.ROUNDTIME(X.ActivityStart))
AND dbo.ROUNDTIME(X.ActivityFinish)
AND
(
Badge LIKE 'GREEN%'
OR Badge LIKE 'RED%'
)
CREATE TABLE #INTERMEDIATE (Activity_ID INT
, Badge VARCHAR(255)
, Game1Type CHAR(13)
, Game2Type CHAR(13)
, Game1Points DECIMAL
, Game1RowAsc INT
, Game1RowDesc INT
, Game2Points DECIMAL
, Game2RowAsc INT
, Game2RowDesc INT
)
INSERT INTO #INTERMEDIATE
SELECT Activity_ID
, Badge
, Game1Type
, Game2Type
, Game1Points
, ROW_NUMBER() OVER (
PARTITION BY Activity_ID, Badge, Game1Type
ORDER BY Game1Points ASC, ID ASC) AS Game1RowAsc
, ROW_NUMBER() OVER (
PARTITION BY Activity_ID, Badge, Game1Type
ORDER BY Game1Points DESC, ID DESC) AS Game1RowDesc
, Game2Points
, ROW_NUMBER() OVER (
PARTITION BY Activity_ID, Badge, Game2Type
ORDER BY Game2Points ASC, ID ASC) AS Game2RowAsc
, ROW_NUMBER() OVER (
PARTITION BY Activity_ID, Badge, Game2Type
ORDER BY Game2Points DESC, ID DESC) AS Game2RowDesc
FROM #ST
CREATE CLUSTERED INDEX [TT1] ON #INTERMEDIATE (Acitivity_ID
, Badge)
CREATE NONCLUSTERED INDEX [TT2] ON #INTERMEDIATE (Game1RowAsc
, Game1RowDesc
, Game2RowAsc
, Game2RowDesc)
SELECT Activity_ID
, Badge
, MAX(CAST([Game1ActivityBefore] AS DECIMAL)) AS 'Game1_Before'
, MAX(CAST([Game1ActivityDuring] AS DECIMAL)) AS 'Game1_During'
, MAX(CAST([Game2ActivityBefore] AS DECIMAL)) AS 'Game2_Before'
, MAX(CAST([Game2ActivityDuring] AS DECIMAL)) AS 'Game2_During'
FROM
(
SELECT Activity_ID
, Badge
, Game1Type
, Game2Type
, AVG(Game1Points) Game1Median
, AVG(Game2Points) Game2Median
FROM #ST
WHERE Game1RowAsc IN (Game1RowDesc, Game1RowDesc - 1, Game1RowDesc + 1)
OR Game2RowAsc IN (Game2RowDesc, Game2RowDesc - 1, Game2RowDesc + 1)
GROUP BY Activity_ID, Badge, Game1Type, Game2Type
) Y
PIVOT
(
MAX(Game1Median)
FOR Game1Type IN ([Game1ActivityBefore], [Game1ActivityDuring])
) AS PivotTable
PIVOT
(
MAX(Game2Median)
FOR Game2Type IN ([Game2ActivityBefore], [Game2ActivityDuring])
) AS PivotTable2
GROUP BY Activity_ID, Badge
DROP TABLE #INTERMEDIATE
DROP TABLE #ST
UPDATE 2:: Execution Plan
May be the Display Execution Plan
option does not show the indices being used?
Answer :
1 – Don’t use CTEs for performance. CTEs are basically single-use views. They convey absolutely NO performance benefit. If performance is important and you have enough data, indexed #temp
tables will probably work a lot better.
2- Definitely use a set-based approach. I can’t think of any scenario where a CURSOR
would be faster than a set-based query.
I’m not going to completely rewrite your query, but I do see some things that should make a huge difference:
- Put your initial selection (outermost CTE) AND your subquery into a
#temp
table and index it. Right now your filters are created dynamically and not seekable!
You are filtering on the ROW_NUMBER
output, 4 times, so that’s a table scan over every row.