# Calculate Percentage for Values Exist in Only One or Both Tables

Posted on

### Question :

There are two tables:

Table 1: data_mobile(user_id, page_id)

user_id page_id
1 2
1 3
2 2
3 4

Table 2: data_web(user_id, webpage_id)

user_id webpage_id
1 5
4 3
5 2
3 4

Problem: Calculate the percentage of users who only use mobile, only website, and both. All figures add up should be 1. It’s better to have all figures showed in the same table.

For example, the expected table can be:

only_mobile only_web both
0.2 0.4 0.4

OR

Type Percentage
only_mobile 0.2
only_web 0.4
both 0.4

Wondering if there is any efficient ways without creating views and can be in the same sql statement if possible. Please let me know if anything is unclear. Appreciated.

There is no need for any complicated set operations.

You just need to full-join the two tables together, with the added complication of doing that only over distinct `user_id`.

Then, a straight-forward conditional aggregation looks like it should work

``````SELECT
COUNT(*) FILTER (WHERE w.user_id IS NULL) * 1.0 / COUNT(*) AS only_mobile,
COUNT(*) FILTER (WHERE m.user_id IS NULL) * 1.0 / COUNT(*) AS only_web,
COUNT(*) FILTER (WHERE m.user_id IS NOT NULL AND w.user_id IS NOT NULL) * 1.0 / COUNT(*) AS both
FROM (SELECT DISTINCT user_id FROM data_mobile) m
FULL JOIN (SELECT DISTINCT user_id FROM data_web) w ON w.user_id = m.user_id
``````

db<>fiddle

The `FILTER` clause is specific to Postgres, you can simulate it in other DBMSs using `COUNT(CASE WHEN...`

This is an excellent demonstration of the set operations available in SQL.

## Set Operators

SQL provides set operators that manipulates two sets (rows) of data in useful ways:

• `UNION` combines two sets. The columns must be the same type. There are two variants, just `UNION` removes duplicates while `UNION ALL` does not.
• `INTERSECT` takes rows that appear in both sets.
• `EXCEPT` takes rows that appear in the first set and not the second.

## Sets

The sets we need to build are:

• A set that includes `user_id`s that only appear in `data_mobile`.
• A set that includes `user_id`s that only appear in `data_web`.
• A set that includes `user_id`s that appear in both tables.
• A set that includes all `user_id`s.

The first three are the data columns that were requested, the fourth is used to calculate the percentages. The sets we need to build map nicely to the available operations:

• `web_only` -> `data_web EXCEPT data_mobile`
• `mobile_only` -> `data_mobile EXCEPT data_web`
• `both` -> `data_web INTERSECT data_mobile`
• `all` -> `data_web UNION data_mobile`

## Query

To demonstrate the various set operators in use, I used CTEs for each set so it is clear which operation is used to build which set.

``````
WITH web_only AS (
SELECT user_id FROM data_web EXCEPT SELECT user_id FROM data_mobile
),
mobile_only AS (
SELECT user_id FROM data_mobile EXCEPT SELECT user_id FROM data_web
),
both_tables AS (
SELECT user_id FROM data_web INTERSECT SELECT user_id FROM data_mobile
),
all_users AS (
SELECT user_id FROM data_web UNION SELECT user_id FROM data_mobile
)

SELECT
(SELECT COUNT(*) FROM mobile_only) / (SELECT COUNT(*) FROM all_users)::double precision AS only_mobile,
(SELECT COUNT(*) FROM web_only) / (SELECT COUNT(*) FROM all_users)::double precision AS only_web,
(SELECT COUNT(*) FROM both_tables) / (SELECT COUNT(*) FROM all_users)::double precision AS both;
``````

## Result

only_mobile only_web both
0.2 0.4 0.4

## Notes

• As requested, this query is one statement.
• The cast to `double precision` is necessary to obtain a floating-point result.
• This should be moderately performant for decent sized sets, but `EXPLAIN ANALYZE` should be watched with large sets to make sure the hash aggregations are not overflowing `work_mem`. Since the `all_users` set depends on deduplication, performance can suffer if PostgreSQL decides to use a sort & unique operation instead.
• There’s probably a way to do this using table joins, flag fields, and clever uses of `SUM()` that would perform better for extremely large input sets.

Views are a form of saved query. You could create a view out of this query and run it with a simple `SELECT * FROM view_name` anytime you want. But it doesn’t have to be a view to perform well.