Count missing values in each column of the table in PostgreSQL

Posted on

Question :

Question: Which column of fortune500 (a table) has the most missing values? To find out, I have checked each column individually mentioned below, which is very tedious.

SELECT 
    SUM(CASE WHEN ticker IS NULL THEN 1 ELSE 0 END) AS ticker_null_num, 
    SUM(CASE WHEN profits_change IS NULL THEN 1 ELSE 0 END) AS profits_change_null_num,
    SUM(CASE WHEN industry IS NULL THEN 1 ELSE 0 END) AS industry_null_num
FROM fortune500;

And also performed the following query for each column individually:

SELECT count(*) - count(ticker) AS missing
  FROM fortune500;
 etc....

My Question:
Is there any better/dynamic way of doing it because this approach is very cumbersome and it will take a lot of time as I have like 50 – 60 columns in a table then what should I do in that case instead of this manual approach. Can somebody help me find the missing values of each column sorted in a descending order with a good method? Like for example:

COLUMN_NAME    MISSING_VALUES_COUNT
col1               60
col2               50
col3               45
col4               40
etc.....  

Answer :

You can convert the rows into JSON to dynamically generate one row for each column:

select colname, 
       count(cols.value) as non_null_values,
       (select count(*) from the_table) - count(cols.value) as missing
from the_table t
  cross join jsonb_each(jsonb_strip_nulls(to_jsonb(t))) as cols(colname, value)
group by colname;

But this isn’t going to be fast on large tables.

A query of this basic form only uses a single sequential scan and is as efficient as it gets:

SELECT x.*
FROM  (
   SELECT count(*) AS ct
        , count(ticker) AS ticker
        , count(profits_change) AS profits_change
        , count(industry) AS industry
     -- , more?
   FROM   fortune500
   ) t
CROSS  JOIN LATERAL (
   VALUES
      ('ticker', ct - ticker)
    , ('profits_change', ct - profits_change)
    , ('industry', ct - industry
 -- , more?
   ) x(column_name, missing_values)
   ORDER  BY missing_values DESC, column_name DESC;

This function generates and executes the query for all columns of a given table dynamically:

CREATE OR REPLACE FUNCTION f_count_nulls(_tbl regclass)
  RETURNS TABLE (column_name text, missing_values bigint)
  LANGUAGE plpgsql STABLE PARALLEL SAFE AS
$func$
BEGIN
   RETURN QUERY EXECUTE (
   SELECT format(
   $$
   SELECT x.*
   FROM  (SELECT count(*) AS ct, %s FROM %s) t
   CROSS  JOIN LATERAL (VALUES %s) x(col, nulls)
   ORDER  BY nulls DESC, col DESC
   $$, string_agg(format('count(%1$I) AS %1$I', attname), ', ')
     , $1
     , string_agg(format('(%1$L, ct - %1$I)', attname), ', ')
      )
   FROM   pg_catalog.pg_attribute
   WHERE  attrelid = $1
   AND    attnum > 0
   AND    NOT attisdropped
   -- more filters?
   );
END
$func$;

Call:

SELECT * FROM f_count_nulls('public.fortune500');  -- optionally schema-qualified

Produces the requested result.
All identifiers are handled safely (quoted when required, no SQL injection).

db<>fiddle here

Related:

Leave a Reply

Your email address will not be published. Required fields are marked *