Question :
There are two tables:
Table 1: data_mobile(user_id, page_id)
user_id | page_id |
---|---|
1 | 2 |
1 | 3 |
2 | 2 |
3 | 4 |
Table 2: data_web(user_id, webpage_id)
user_id | webpage_id |
---|---|
1 | 5 |
4 | 3 |
5 | 2 |
3 | 4 |
Problem: Calculate the percentage of users who only use mobile, only website, and both. All figures add up should be 1. It’s better to have all figures showed in the same table.
For example, the expected table can be:
only_mobile | only_web | both |
---|---|---|
0.2 | 0.4 | 0.4 |
OR
Type | Percentage |
---|---|
only_mobile | 0.2 |
only_web | 0.4 |
both | 0.4 |
Wondering if there is any efficient ways without creating views and can be in the same sql statement if possible. Please let me know if anything is unclear. Appreciated.
Answer :
There is no need for any complicated set operations.
You just need to full-join the two tables together, with the added complication of doing that only over distinct user_id
.
Then, a straight-forward conditional aggregation looks like it should work
SELECT
COUNT(*) FILTER (WHERE w.user_id IS NULL) * 1.0 / COUNT(*) AS only_mobile,
COUNT(*) FILTER (WHERE m.user_id IS NULL) * 1.0 / COUNT(*) AS only_web,
COUNT(*) FILTER (WHERE m.user_id IS NOT NULL AND w.user_id IS NOT NULL) * 1.0 / COUNT(*) AS both
FROM (SELECT DISTINCT user_id FROM data_mobile) m
FULL JOIN (SELECT DISTINCT user_id FROM data_web) w ON w.user_id = m.user_id
The FILTER
clause is specific to Postgres, you can simulate it in other DBMSs using COUNT(CASE WHEN...
This is an excellent demonstration of the set operations available in SQL.
References
- https://www.postgresql.org/docs/current/queries-union.html
- https://www.postgresql.org/docs/current/sql-select.html#SQL-UNION
Set Operators
SQL provides set operators that manipulates two sets (rows) of data in useful ways:
UNION
combines two sets. The columns must be the same type. There are two variants, justUNION
removes duplicates whileUNION ALL
does not.INTERSECT
takes rows that appear in both sets.EXCEPT
takes rows that appear in the first set and not the second.
Sets
The sets we need to build are:
- A set that includes
user_id
s that only appear indata_mobile
. - A set that includes
user_id
s that only appear indata_web
. - A set that includes
user_id
s that appear in both tables. - A set that includes all
user_id
s.
The first three are the data columns that were requested, the fourth is used to calculate the percentages. The sets we need to build map nicely to the available operations:
web_only
->data_web EXCEPT data_mobile
mobile_only
->data_mobile EXCEPT data_web
both
->data_web INTERSECT data_mobile
all
->data_web UNION data_mobile
Query
To demonstrate the various set operators in use, I used CTEs for each set so it is clear which operation is used to build which set.
WITH web_only AS (
SELECT user_id FROM data_web EXCEPT SELECT user_id FROM data_mobile
),
mobile_only AS (
SELECT user_id FROM data_mobile EXCEPT SELECT user_id FROM data_web
),
both_tables AS (
SELECT user_id FROM data_web INTERSECT SELECT user_id FROM data_mobile
),
all_users AS (
SELECT user_id FROM data_web UNION SELECT user_id FROM data_mobile
)
SELECT
(SELECT COUNT(*) FROM mobile_only) / (SELECT COUNT(*) FROM all_users)::double precision AS only_mobile,
(SELECT COUNT(*) FROM web_only) / (SELECT COUNT(*) FROM all_users)::double precision AS only_web,
(SELECT COUNT(*) FROM both_tables) / (SELECT COUNT(*) FROM all_users)::double precision AS both;
Result
only_mobile | only_web | both |
---|---|---|
0.2 | 0.4 | 0.4 |
Notes
- As requested, this query is one statement.
- The cast to
double precision
is necessary to obtain a floating-point result. - This should be moderately performant for decent sized sets, but
EXPLAIN ANALYZE
should be watched with large sets to make sure the hash aggregations are not overflowingwork_mem
. Since theall_users
set depends on deduplication, performance can suffer if PostgreSQL decides to use a sort & unique operation instead. - There’s probably a way to do this using table joins, flag fields, and clever uses of
SUM()
that would perform better for extremely large input sets.
A Word About Views
Views are a form of saved query. You could create a view out of this query and run it with a simple SELECT * FROM view_name
anytime you want. But it doesn’t have to be a view to perform well.