This is quite a long question, please bear with me.
So I’d like to first explain I have a database of firewall logs created using the following command:
CREATE TABLE firewall_logs_mapped ( log_time text, log_time_mapped double precision, syslog_priority text, syslog_priority_mapped double precision, operation text, operation_mapped double precision, message_code text, message_code_mapped double precision, protocol text, protocol_mapped double precision, source_ip text, source_ip_mapped double precision, destination_ip text, destination_ip_mapped double precision, source_port text, source_port_mapped double precision, destination_port text, destination_port_mapped double precision, destination_service text, destination_service_mapped double precision, direction text, direction_mapped double precision, connections_built text, connections_built_mapped double precision, connections_torn_down text, connections_torn_down_mapped double precision, hourofday text, hourofday_mapped double precision, meridiem text, meridiem_mapped double precision )
So basically for every value in the database there is also a mapped value which is a float representing the y co-ordinate for the value such that it can be plotted on a graph which I have set-up.
I also have an index on each of the un-mapped columns in the database.
Now I have a an interface whereby the user can search the database by specifying the column name and value, and the software will convert this to a query and query the database. I.e. they can input something like:
and the software creates and runs the following query:
SELECT DISTINCT log_time_mapped, syslog_priority_mapped, operation_mapped, message_code_mapped, protocol_mapped, source_ip_mapped, destination_ip_mapped, source_port_mapped, destination_port_mapped, destination_service_mapped, direction_mapped, connections_built_mapped, connections_torn_down_mapped, hourofday_mapped, meridiem_mapped FROM firewall_logs_mapped WHERE operation = 'Built'
Now this particular query doesn’t actually use the index (presumably because it returns ~48% of the database rows) and returns 11,426,373 rows in 126,775ms. Here is the explain analyze for the query:
"Seq Scan on public.firewall_logs_mapped (cost=0.00..1234282.95 rows=11295161 width=120) (actual time=0.357..16139.005 rows=11426373 loops=1)" " Output: log_time_mapped, syslog_priority_mapped, operation_mapped, message_code_mapped, protocol_mapped, source_ip_mapped, destination_ip_mapped, source_port_mapped, destination_port_mapped, destination_service_mapped, direction_mapped, connections_built (...)" " Filter: (firewall_logs_mapped.operation = 'Built'::text)" " Rows Removed by Filter: 12049756" "Total runtime: 16751.255 ms"
Now if i understand this right it seems the time is spent filtering the data by the where clause. Can this time be reduced?
So as I am plotting a parallel co-ordinate graph I only require the unique combinations of the column values returned through the where clause, as otherwise there will just be a lot of lines overdrawn on each other.
How would I go about selecting only the unique combinations of adjacent columns? Is this too complicated a task to perform through a database query? Would it speed up the query?
SELECT DISTINCT will select the unique rows (combinations of all values in a row), I’m talking about the unique combinations of any two adjacent columns, as in parallel co-ordinates a line is drawn between adjacent columns, for each column in a row, i.e. from log_time to syslog_priority, the next line for this row will be from syslog_priority to operation etc. Check out this link: http://en.wikipedia.org/wiki/File:ParCorFisherIris.png, in my case each axes would be one column in the database.
Currently I do this in Python by using the results to create an array of co-ordinates for the lines to be drawn, and then I loop through the array adding the line co-ordinates to a set if not seen before in the loop.
Indexes have been created as follows:
CREATE INDEX unique_log_time_index ON unique_firewall_logs_mapped (log_time); CREATE INDEX unique_syslog_priority_index ON unique_firewall_logs_mapped (syslog_priority); CREATE INDEX unique_operation_index ON unique_firewall_logs_mapped (operation); CREATE INDEX unique_message_code_index ON unique_firewall_logs_mapped (message_code); CREATE INDEX unique_protocol_index ON unique_firewall_logs_mapped (protocol); CREATE INDEX unique_source_ip_index ON unique_firewall_logs_mapped (source_ip); CREATE INDEX unique_destination_ip_index ON unique_firewall_logs_mapped (destination_ip); CREATE INDEX unique_source_port_index ON unique_firewall_logs_mapped (source_port); CREATE INDEX unique_destination_port_index ON unique_firewall_logs_mapped (destination_port); CREATE INDEX unique_destination_service_index ON unique_firewall_logs_mapped (destination_service); CREATE INDEX unique_direction_index ON unique_firewall_logs_mapped (direction); CREATE INDEX unique_connections_built_index ON unique_firewall_logs_mapped (connections_built); CREATE INDEX unique_connections_torn_down_index ON unique_firewall_logs_mapped (connections_torn_down); CREATE INDEX unique_hourofday_index ON unique_firewall_logs_mapped (hourofday); CREATE INDEX unique_meridiem_index ON unique_firewall_logs_mapped (meridiem);
My PostgreSQL version is 9.3, random_page_cost is 4.0, seq_page_cost is 1.0.
Also I’d like to say I tried forcing the use of the index and the query took 2 seconds or so longer than without the index.
The execution plan shown does not seem to match the big
SELECT DISTINCT query because the
Unique steps are missing. Anyway you are correct than when retrieving ~50% of a table, index don’t help. The best strategy is a big sequential scan of the main table and only fast hardware helps with that.
For the 2nd part of the question:
How would I go about selecting only the unique combinations of
adjacent columns? Is this too complicated a task to perform through a
database query? Would it speed up the query?
To remove duplicate combinations of adjacent columns, the structure of the resultset should be changed so that each output row has only one couple of adjacent columns along with their corresponding dimensions in the parallel coordinates graph. Well, except that the dimension for the 2nd column is not necessary since it’s always the dimension for the other column plus one.
In one single query, this could be written like this:
WITH logs as ( SELECT log_time_mapped, syslog_priority_mapped, operation_mapped, message_code_mapped, protocol_mapped, source_ip_mapped, destination_ip_mapped, source_port_mapped, destination_port_mapped, destination_service_mapped, direction_mapped, connections_built_mapped, connections_torn_down_mapped, hourofday_mapped, meridiem_mapped FROM firewall_logs_mapped WHERE operation = 'Built') SELECT DISTINCT 1, log_time_mapped, syslog_priority_mapped FROM logs UNION ALL SELECT DISTINCT 2, syslog_priority_mapped, operation_mapped FROM logs UNION ALL SELECT DISTINCT 3, operation_mapped, message_code_mapped FROM logs UNION ALL ...etc... SELECT DISTINCT 14, hourofday_mapped, meridiem_mapped FROM logs ;
SELECT DISTINCT subquery extracts the lines to draw between dimensions 1 and 2, the next subquery between dimensions 2 and 3, and so on.
DISTINCT eliminates duplicates, so the client side doesn’t have to do it. The
UNION ALL concatenates the results without any further processing.
However it’s a heavy query and it’s dubious that it would be any faster than what you’re already doing.
WITH subquery is likely to gets slowly materialized on disk, so it might be interesting to compare the execution time with this other form repeating the same condition:
SELECT DISTINCT 1, log_time_mapped, syslog_priority_mapped FROM firewall_logs_mapped WHERE operation = 'Built' UNION ALL SELECT DISTINCT 2, syslog_priority_mapped, operation_mapped FROM firewall_logs_mapped WHERE operation = 'Built' UNION ALL SELECT DISTINCT 3, operation_mapped, message_code_mapped FROM firewall_logs_mapped WHERE operation = 'Built' ...etc... ;