Question :
I need help optimizing this function as it takes more than 2 minutes currently which is far from optimal. This function basically gives you the name of column, corresponding table and list of unique values contained in the column.
Function Definition:
CREATE OR REPLACE FUNCTION get_pubcol(
included_tables name[] default '{}',
included_schema name[] default '{}'
)
RETURNS table(schema_name text, table_name text, column_name text, column_value text)
AS $$
begin
FOR schema_name, table_name, column_name IN
SELECT c.table_schema, c.table_name, c.column_name
FROM information_schema.columns c
JOIN information_schema.tables t ON
(t.table_name = c.table_name AND t.table_schema=c.table_schema)
WHERE (c.table_name=ANY(included_tables) OR included_tables='{}')
AND (c.table_schema=ANY(included_schema) OR included_schema='{}')
AND t.table_type='BASE TABLE'
LOOP
EXECUTE format('SELECT ARRAY_AGG(DISTINCT %I) AS column_value FROM %I.%I',
column_name,
schema_name,
table_name
) INTO column_value;
IF column_value is not null THEN
RETURN NEXT;
END IF;
END LOOP;
END;
$$ language plpgsql;
Function call:
SELECT * FROM get_pubcol('{}', '{public}')
Explain Analyze Result:
"Function Scan on get_pubcol (cost=0.25..10.25 rows=1000 width=128) (actual time=121235.178..121235.199 rows=73 loops=1)"
"Planning time: 0.049 ms"
"Execution time: 121237.436 ms"
Query Result:
schema_name table_name column_name column_value
"public" "MARA" "Lab/Office" "{" ",BL1,GL1,GP1,KL1,KP1,NaN,SL1,SP1,WL1,WL2,WP1}"
Answer :
You are running one query for each column and table. One way to optimize this, is to only run a single query for each table that retrieves the distinct values for all columns in one go. This result can be put into a JSONB which in turn can be returned as individual rows using jsonb_each
:
CREATE OR REPLACE FUNCTION get_pubcol(
included_tables name[] default '{}',
included_schema name[] default '{}'
)
RETURNS table(schema_name text, table_name text, column_name text, column_value text)
AS
$$
declare
l_value jsonb;
l_sql text;
begin
FOR schema_name, table_name, l_sql IN
SELECT c.table_schema, c.table_name,
concat('select to_jsonb(t) from ( select ',
string_agg(format('array_agg(distinct %I) as %I', c.column_name, c.column_name), ', '),
format(' from %I.%I) as t', c.table_schema, c.table_name)) as sql
FROM information_schema.columns c
JOIN information_schema.tables t
ON t.table_name = c.table_name AND t.table_schema=c.table_schema
WHERE t.table_type = 'BASE TABLE'
AND (c.table_name = ANY(included_tables) OR included_tables='{}')
AND (c.table_schema = ANY(included_schema) OR included_schema='{}')
group by c.table_schema, c.table_name
LOOP
EXECUTE l_sql into l_value;
IF l_value is not null THEN
return query
select schema_name, table_name, x.*
from jsonb_each_text(l_value) as x(column_name, column_value);
END IF;
END LOOP;
END;
$$ language plpgsql;
The outer query assembles a single query for each table. E.g. if you have a table person
with the columns id
, firstname
, lastname
the expression in the query for the loop would generate something like:
select to_jsonb(t)
from (
select array_agg(distinct lastname) as lastname, array_agg(distinct id) as id, array_agg(distinct firstname) as firstname
from public.person
) as t
That query returns a single jsonb value that contains the information for all columns. This is stored in the local variable l_value
, something like:
{
"id": [42,43,44,45],
"lastname": ["Beeblebrox","Dent","McMillan","Prefect"],
"firstname": ["Arthur","Ford","Tricia","Zaphod"]
}
This is then turned back into rows using jsonb_each_text
.
The display will be different though.
Your example
{BL1,GL1,GP1,KL1,KP1,NaN,SL1,SP1,WL1,WL2,WP1}
will be displayed as
["BL1","GL1","GP1","KL1","KP1","NaN","SL1","SP1","WL1","WL2","WP1"]