Generate multiple running totals with GROUP BY day

Posted on

Question :

I have a set of transactions of stock purchases by users and I want to keep track of a running balance of each stock as the year progresses. I am using a windowing function to track the running balance but for some reason I cannot get the GROUP BY portion of this query to work.

It continues to have duplicate days in the result set even when I attempt to group by the date (created_at). Sample below:

  sum(case when t.stock_ticker = 'tsla' then t.amount end) over (order by t.created_at::date rows unbounded preceding) as tsla_running_amount,
  sum(case when t.stock_ticker = 'goog' then t.amount end) over (order by t.created_at::date rows unbounded preceding) as goog_running_amount,
from transactions t
group by t.created_at, t.customer_id, t.stock_ticker, t.amount
order by t.created_at desc;

Test setup:

CREATE TABLE transactions (
   transaction_id varchar(255) NOT NULL,
   amount float8 NOT NULL,
   stock_ticker varchar(255) NOT NULL,
   transaction_type varchar(255) NOT NULL,
   customer_id varchar NOT NULL,
   inserted_at timestamp NOT NULL,
   created_at timestamp NOT NULL,
   CONSTRAINT transactions_pkey PRIMARY KEY (transaction_id)

INSERT INTO transactions(transaction_id, amount, stock_ticker, transaction_type, customer_id, inserted_at, created_at)
  ('123123abmk12', 10, 'tsla', 'purchase', 'a1b2c3', '2020-04-01 01:00:00', '2020-04-01 01:00:00')
, ('123123abmk13', 20, 'tsla', 'purchase', 'a1b2c3', '2020-04-03 01:00:00', '2020-04-03 01:00:00')
, ('123123abmk14',  5, 'goog', 'purchase', 'a1b2c3', '2020-04-01 01:00:00', '2020-04-01 01:00:00')
, ('123123abmk15',  8, 'goog', 'purchase', 'a1b2c3', '2020-04-03 01:00:00', '2020-04-03 01:00:00');

CREATE INDEX ix_transactions_customer_id ON transactions USING btree (customer_id);

The result here always comes back with multiple rows per day, when I want them to be grouped all into one day.

After doing some research I attempted to cast created_at to date in the GROUP BY clause as well, but I get this error:

Column t.created_at must appear in the GROUP BY clause or be used in an aggregate function

In addition, the results are only going to show days in which a transaction has happened for a user. I need to be able to show a row for each day in a time series (1 year) even if the user did not make a transaction that day. (Using the most recent running balance on the row instead.)

I think that generate_series() is the way to go, but I am having trouble understanding how to fit it in.

Answer :

I see a couple of issues. This should do it:

SELECT * -- ⑥
FROM   (  -- ①
   SELECT the_day::date
   FROM   generate_series(timestamp '2020-01-01', date_trunc('day', localtimestamp), interval '1 day') the_day
   ) d 
LEFT   JOIN ( -- ②
   SELECT customer_id
        , created_at::date AS the_day -- ⑥
        , sum(sum(t.amount) FILTER (WHERE stock_ticker = 'tsla')) OVER w AS tsla_running_amount -- ③
        , sum(sum(t.amount) FILTER (WHERE stock_ticker = 'goog')) OVER w AS goog_running_amount
   FROM   transactions t
   WHERE  created_at >= timestamp '2020-01-01'  -- ④
   GROUP  BY customer_id, created_at::date  -- ⑤
   WINDOW w AS (PARTITION BY customer_id ORDER BY created_at::date) -- ③
   ) t USING (the_day) -- ⑥
ORDER  BY customer_id, the_day; -- ⑦

db<>fiddle here

① Generating all days of the current year in optimized fashion. See:

② The LEFT JOIN you had in mind. You may really want one row per combination of (customer_id, day). See:

③ While you also GROUP BY customer_id the Window frame needs to lead with PARTITION BY customer_id.

Drop rows unbounded preceding. The default window frame should be just fine. The manual:

The default framing option is RANGE UNBOUNDED PRECEDING, which is the

Use the aggregate FILTER clause instead of the CASE expression. Shorter, faster, cleaner. See:

Finally wrap the result of the aggregation in another sum() for the window function. See:

The separate WINDOW clause avoids spelling out the same repeatedly. No effect on performance.

④ The WHERE is logically redundant, but assuming there are older rows, too, it excludes irrelevant rows early, improving performance.
Asuming no or almost no future timestamps, so no upper bound.

⑤ You cannot have t.stock_ticker, t.amount in GROUP BY as you want one row per (customer_id, the_day).

⑥ Use the same column alias (the_day in my example) to allow the simple USING (the_day) in the JOIN clause, and SELECT * at the top level.

⑦ Not sure about the final sort order. This seems more useful to me. Adjust to your preference.

Leave a Reply

Your email address will not be published. Required fields are marked *