Question :
I have a set of transactions of stock purchases by users and I want to keep track of a running balance of each stock as the year progresses. I am using a windowing function to track the running balance but for some reason I cannot get the GROUP BY
portion of this query to work.
It continues to have duplicate days in the result set even when I attempt to group by the date (created_at
). Sample below:
select
t.customer_id,
t.created_at::date,
sum(case when t.stock_ticker = 'tsla' then t.amount end) over (order by t.created_at::date rows unbounded preceding) as tsla_running_amount,
sum(case when t.stock_ticker = 'goog' then t.amount end) over (order by t.created_at::date rows unbounded preceding) as goog_running_amount,
from transactions t
group by t.created_at, t.customer_id, t.stock_ticker, t.amount
order by t.created_at desc;
Test setup:
CREATE TABLE transactions (
transaction_id varchar(255) NOT NULL,
amount float8 NOT NULL,
stock_ticker varchar(255) NOT NULL,
transaction_type varchar(255) NOT NULL,
customer_id varchar NOT NULL,
inserted_at timestamp NOT NULL,
created_at timestamp NOT NULL,
CONSTRAINT transactions_pkey PRIMARY KEY (transaction_id)
);
INSERT INTO transactions(transaction_id, amount, stock_ticker, transaction_type, customer_id, inserted_at, created_at)
VALUES
('123123abmk12', 10, 'tsla', 'purchase', 'a1b2c3', '2020-04-01 01:00:00', '2020-04-01 01:00:00')
, ('123123abmk13', 20, 'tsla', 'purchase', 'a1b2c3', '2020-04-03 01:00:00', '2020-04-03 01:00:00')
, ('123123abmk14', 5, 'goog', 'purchase', 'a1b2c3', '2020-04-01 01:00:00', '2020-04-01 01:00:00')
, ('123123abmk15', 8, 'goog', 'purchase', 'a1b2c3', '2020-04-03 01:00:00', '2020-04-03 01:00:00');
CREATE INDEX ix_transactions_customer_id ON transactions USING btree (customer_id);
The result here always comes back with multiple rows per day, when I want them to be grouped all into one day.
After doing some research I attempted to cast created_at
to date
in the GROUP BY
clause as well, but I get this error:
Column t.created_at must appear in the GROUP BY clause or be used in an aggregate function
In addition, the results are only going to show days in which a transaction has happened for a user. I need to be able to show a row for each day in a time series (1 year) even if the user did not make a transaction that day. (Using the most recent running balance on the row instead.)
I think that generate_series()
is the way to go, but I am having trouble understanding how to fit it in.
Answer :
I see a couple of issues. This should do it:
SELECT * -- ⑥
FROM ( -- ①
SELECT the_day::date
FROM generate_series(timestamp '2020-01-01', date_trunc('day', localtimestamp), interval '1 day') the_day
) d
LEFT JOIN ( -- ②
SELECT customer_id
, created_at::date AS the_day -- ⑥
, sum(sum(t.amount) FILTER (WHERE stock_ticker = 'tsla')) OVER w AS tsla_running_amount -- ③
, sum(sum(t.amount) FILTER (WHERE stock_ticker = 'goog')) OVER w AS goog_running_amount
FROM transactions t
WHERE created_at >= timestamp '2020-01-01' -- ④
GROUP BY customer_id, created_at::date -- ⑤
WINDOW w AS (PARTITION BY customer_id ORDER BY created_at::date) -- ③
) t USING (the_day) -- ⑥
ORDER BY customer_id, the_day; -- ⑦
db<>fiddle here
① Generating all days of the current year in optimized fashion. See:
② The LEFT JOIN
you had in mind. You may really want one row per combination of (customer_id, day)
. See:
③ While you also GROUP BY customer_id
the Window frame needs to lead with PARTITION BY customer_id
.
Drop rows unbounded preceding
. The default window frame should be just fine. The manual:
The default framing option is
RANGE UNBOUNDED PRECEDING
, which is the
same asRANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
.
Use the aggregate FILTER
clause instead of the CASE
expression. Shorter, faster, cleaner. See:
Finally wrap the result of the aggregation in another sum()
for the window function. See:
The separate WINDOW
clause avoids spelling out the same repeatedly. No effect on performance.
④ The WHERE
is logically redundant, but assuming there are older rows, too, it excludes irrelevant rows early, improving performance.
Asuming no or almost no future timestamps, so no upper bound.
⑤ You cannot have t.stock_ticker, t.amount
in GROUP BY
as you want one row per (customer_id, the_day)
.
⑥ Use the same column alias (the_day
in my example) to allow the simple USING (the_day)
in the JOIN
clause, and SELECT *
at the top level.
⑦ Not sure about the final sort order. This seems more useful to me. Adjust to your preference.