Question :
I have data like the following:
id | src_ip | dst_ip | port |
----|----------------|--------------|------|
256 | 192.168.61.200 | 10.75.64.222 | 80 |
257 | 192.168.61.200 | 10.75.64.222 | 81 |
258 | 10.65.72.207 | 10.75.64.223 | 80 |
259 | 10.75.254.27 | 10.75.64.223 | 82 |
260 | 10.75.254.27 | 10.75.64.224 | 80 |
261 | 10.75.254.27 | 10.75.64.230 | 80 |
262 | 10.75.254.27 | 10.75.64.230 | 81 |
263 | 10.75.254.27 | 10.75.64.230 | 82 |
264 | 10.75.254.27 | 10.75.64.230 | 83 |
265 | 10.75.254.27 | 10.75.64.230 | 84 |
I built the schema and loaded the same data here: http://sqlfiddle.com/#!9/1185c
I’m planning to group the rows in a way to reproduce some rules for my firewall. Given the firewall supports ranges for ports or IPs I want to query this table and group as many rows as I can to produce valid rules.
There are a few different ways to group rows, but I want to start grouping by consecutive dst_ip
on the same port.
Given the table presented above, I would like to get something like this:
dst_ip | max_consecutive | port |
--------------|-----------------|------|
10.75.64.222 | 10.75.64.224 | 80 |
10.75.64.222 | 10.75.64.222 | 81 |
10.75.64.223 | 10.75.64.223 | 82 |
10.75.64.230 | 10.75.64.230 | 80 |
10.75.64.230 | 10.75.64.230 | 81 |
10.75.64.230 | 10.75.64.230 | 82 |
10.75.64.230 | 10.75.64.230 | 83 |
10.75.64.230 | 10.75.64.230 | 84 |
Note: max_consecutive
could be the number of consecutive rows, too, not necessarily the upper IP.
Actually, if I can go deeper, I would like to group by port
, too. That means the last 5 rows of the result will be summarized in just one like
10.75.64.230 | 10.75.64.230 | 80 | 84
I’m currently using MySQL, but I can move to PostgreSQL if it is necessary.
Answer :
In Postgres (tested with v9.3) you can use the dedicated inet
data type, to store IPv4 addresses with only 7 bytes (or IPv6 with 19 bytes) and with automatic integrity checks and dedicated functions and type casts etc.
Schema
The translated schema could look like this:
CREATE TABLE log (
id serial PRIMARY KEY
, dst_port int
, src_ip inet
, dst_ip inet
);
CREATE INDEX ON log (dst_port);
CREATE INDEX ON log (src_ip);
I moved to dst_port int
to the 2nd position to optimize alignment / padding:
Now we can use standard window functions (not possible in MySQL).
Step 1: Fold groups of consecutive dst_ip
for same (dst_port
)
One special difficulty: The aggregate function min()
/ max()
are not yet implemented for inet
in Postgres 9.4. Both are in the upcoming Postgres 9.5!
So I substituted with DISTINCT ON
in the first step:
SELECT DISTINCT ON (dst_port, ip_grp)
dst_ip, count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct, dst_port
FROM (
SELECT dst_ip, dst_port, dst_ip - row_number() OVER (PARTITION BY dst_port
ORDER BY dst_ip) AS ip_grp
FROM log
ORDER BY dst_port, dst_ip
) sub
ORDER BY dst_port, ip_grp, dst_ip;
Result as desired – with a count of rows (could be upper IP as well).
You can subtract/add integer
from/to the inet
type. By subtracting the row_number()
all consecutive rows get the same grp
– the value of grp
is irrelevant, just the fast that it’s the same per partition (dst_port
).
Then we can GROUP BY ...
– or in this special case DISTINCT ON dst_port, ip_grp
. I use another window function to get the count ip_ct
in the same step: count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct
.
Note that consecutive IPs can cross byte boundaries (see my comment to question).
Detailed explanation for this technique:
Step 2: Fold groups of consecutive dst_port
for same (dst_ip, ip_ct)
SELECT dst_ip, ip_ct, min(dst_port) AS dst_port, count(*) AS port_ct
FROM (
SELECT *, dst_port - row_number() OVER (PARTITION BY dst_ip, ip_ct
ORDER BY dst_port) AS port_grp
FROM (
SELECT DISTINCT ON (dst_port, ip_grp)
dst_ip, count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct, dst_port
FROM (
SELECT dst_ip, dst_port, dst_ip - row_number() OVER (PARTITION BY dst_port
ORDER BY dst_ip) AS ip_grp
FROM log
ORDER BY dst_port, dst_ip
) sub1
ORDER BY dst_port, ip_grp, dst_ip
) sub2
) sub3
GROUP BY 1, 2, port_grp
ORDER BY 1, 3, 2;
Basically, repeat the same logic like in the first step, applied to the result of the first step.
But now you have to group on ip_ct
additionally. And this time, you can use the simpler min(dst_port)
, since the port number is a plain integer
.
SQL Fiddle demonstrating all.