Calculate Percentage for Values Exist in Only One or Both Tables

Posted on

Question :

There are two tables:

Table 1: data_mobile(user_id, page_id)

user_id page_id
1 2
1 3
2 2
3 4

Table 2: data_web(user_id, webpage_id)

user_id webpage_id
1 5
4 3
5 2
3 4

Problem: Calculate the percentage of users who only use mobile, only website, and both. All figures add up should be 1. It’s better to have all figures showed in the same table.

For example, the expected table can be:

only_mobile only_web both
0.2 0.4 0.4

OR

Type Percentage
only_mobile 0.2
only_web 0.4
both 0.4

Wondering if there is any efficient ways without creating views and can be in the same sql statement if possible. Please let me know if anything is unclear. Appreciated.

Answer :

There is no need for any complicated set operations.

You just need to full-join the two tables together, with the added complication of doing that only over distinct user_id.

Then, a straight-forward conditional aggregation looks like it should work

SELECT
  COUNT(*) FILTER (WHERE w.user_id IS NULL) * 1.0 / COUNT(*) AS only_mobile,
  COUNT(*) FILTER (WHERE m.user_id IS NULL) * 1.0 / COUNT(*) AS only_web,
  COUNT(*) FILTER (WHERE m.user_id IS NOT NULL AND w.user_id IS NOT NULL) * 1.0 / COUNT(*) AS both  
FROM (SELECT DISTINCT user_id FROM data_mobile) m
FULL JOIN (SELECT DISTINCT user_id FROM data_web) w ON w.user_id = m.user_id

db<>fiddle

The FILTER clause is specific to Postgres, you can simulate it in other DBMSs using COUNT(CASE WHEN...

This is an excellent demonstration of the set operations available in SQL.

References

Demonstration fiddle.

Set Operators

SQL provides set operators that manipulates two sets (rows) of data in useful ways:

  • UNION combines two sets. The columns must be the same type. There are two variants, just UNION removes duplicates while UNION ALL does not.
  • INTERSECT takes rows that appear in both sets.
  • EXCEPT takes rows that appear in the first set and not the second.

Sets

The sets we need to build are:

  • A set that includes user_ids that only appear in data_mobile.
  • A set that includes user_ids that only appear in data_web.
  • A set that includes user_ids that appear in both tables.
  • A set that includes all user_ids.

The first three are the data columns that were requested, the fourth is used to calculate the percentages. The sets we need to build map nicely to the available operations:

  • web_only -> data_web EXCEPT data_mobile
  • mobile_only -> data_mobile EXCEPT data_web
  • both -> data_web INTERSECT data_mobile
  • all -> data_web UNION data_mobile

Query

To demonstrate the various set operators in use, I used CTEs for each set so it is clear which operation is used to build which set.


WITH web_only AS (
    SELECT user_id FROM data_web EXCEPT SELECT user_id FROM data_mobile
),
mobile_only AS (
    SELECT user_id FROM data_mobile EXCEPT SELECT user_id FROM data_web
),
both_tables AS (
    SELECT user_id FROM data_web INTERSECT SELECT user_id FROM data_mobile
),
all_users AS (
    SELECT user_id FROM data_web UNION SELECT user_id FROM data_mobile
)

SELECT
    (SELECT COUNT(*) FROM mobile_only) / (SELECT COUNT(*) FROM all_users)::double precision AS only_mobile,
    (SELECT COUNT(*) FROM web_only) / (SELECT COUNT(*) FROM all_users)::double precision AS only_web,
    (SELECT COUNT(*) FROM both_tables) / (SELECT COUNT(*) FROM all_users)::double precision AS both;

Result

only_mobile only_web both
0.2 0.4 0.4

Notes

  • As requested, this query is one statement.
  • The cast to double precision is necessary to obtain a floating-point result.
  • This should be moderately performant for decent sized sets, but EXPLAIN ANALYZE should be watched with large sets to make sure the hash aggregations are not overflowing work_mem. Since the all_users set depends on deduplication, performance can suffer if PostgreSQL decides to use a sort & unique operation instead.
  • There’s probably a way to do this using table joins, flag fields, and clever uses of SUM() that would perform better for extremely large input sets.

A Word About Views

Views are a form of saved query. You could create a view out of this query and run it with a simple SELECT * FROM view_name anytime you want. But it doesn’t have to be a view to perform well.

Leave a Reply

Your email address will not be published. Required fields are marked *