Efficient query to get last row group by multiple columns

Posted on

Question :

I have a table like the following:

spread_id serial NOT NULL,
game_id integer NOT NULL,
sportsbook_id integer NOT NULL,
spread_type integer NOT NULL,
spread_duration integer NOT NULL,
home_line double precision,
home_odds integer,
away_line double precision,
away_odds integer,
update_time timestamp without time zone NOT NULL,
game_update_count integer NOT NULL

I’m trying to get the last row inserted (max update_time or game_update_count), for each group of sportsbook_id, spread_type, spread_duration, and game_id.

The following query gets me close, but I am not able to select the lines/odds without Postgres complaining.

SELECT 
     spreads.game_id, sportsbook_id, spread_type, spread_duration,
     MAX(game_update_count) AS game_update_count 
FROM spreads 
LEFT JOIN schedule ON 
        schedule.game_id = spreads.game_id 

WHERE date >= '2012-01-01' AND date <= '2012-01-02' 
GROUP BY 
     spreads.game_id, sportsbook_id, spread_type, spread_duration 
ORDER BY 
    spread_duration, spread_type, sportsbook_id, spreads.game_id,
    game_update_count DESC;

Anyone have any thoughts on a better approach to take?

Answer :

The simplest way in Postgres is with DISTINCT ON:

SELECT DISTINCT ON (1,2,3,4)
       sp.game_id, sportsbook_id, spread_type, spread_duration, game_update_count
FROM   spreads sp
LEFT   JOIN schedule sch USING (game_id)
WHERE  date >= '2012-01-01'
AND    date <= '2012-01-02'
ORDER  BY 4,3,2,1, game_update_count DESC;

Details:

The numbers are just syntax shorthand referring to the ordinal position of SELECT items.

If game_update_count can be NULL, you’ll want game_update_count DESC NULLS LAST.

Instead of using a group by, have you tried using a Window Function?

Try this:

SELECT * 
FROM (
    SELECT 
       spreads.game_id, sportsbook_id, spread_type, spread_duration,
       row_number() over (partition by spreads.game_id, sportsbook_id, spread_type, spread_duration order by update_time desc) priority
    FROM spreads 
    LEFT JOIN schedule ON 
        schedule.game_id = spreads.game_id 
    WHERE date >= '2012-01-01' AND date <= '2012-01-02'
     )
WHERE priority = 1

The row_number() which I’ve aliased as priority is the window function and what it is doing is conceptually similar to a GROUP BY clause. The difference is that it is just allowing you to still see the data with row-level fidelity through the window. In this case, you’ll notice that the columns you grouped by have been used to partition the data. The outer SELECT statement eliminates the data that you don’t want (all of the out of date sport-bet lines).

I wish you the best as you implement Window Functions (and with your sport betting)!

(word of warning… I typically work with SQL Server guy, so my code may not by syntactically correct, but it should get you off to a good start)

Leave a Reply

Your email address will not be published. Required fields are marked *