Group by maximum consecutive row

Posted on

Question :

I have data like the following:

id  | src_ip         | dst_ip       | port |
----|----------------|--------------|------|
256 | 192.168.61.200 | 10.75.64.222 | 80   |
257 | 192.168.61.200 | 10.75.64.222 | 81   |
258 | 10.65.72.207   | 10.75.64.223 | 80   |
259 | 10.75.254.27   | 10.75.64.223 | 82   |
260 | 10.75.254.27   | 10.75.64.224 | 80   |
261 | 10.75.254.27   | 10.75.64.230 | 80   |
262 | 10.75.254.27   | 10.75.64.230 | 81   |
263 | 10.75.254.27   | 10.75.64.230 | 82   |
264 | 10.75.254.27   | 10.75.64.230 | 83   |
265 | 10.75.254.27   | 10.75.64.230 | 84   |

I built the schema and loaded the same data here: http://sqlfiddle.com/#!9/1185c

I’m planning to group the rows in a way to reproduce some rules for my firewall. Given the firewall supports ranges for ports or IPs I want to query this table and group as many rows as I can to produce valid rules.

There are a few different ways to group rows, but I want to start grouping by consecutive dst_ip on the same port.

Given the table presented above, I would like to get something like this:

 dst_ip       | max_consecutive | port |
--------------|-----------------|------|
 10.75.64.222 | 10.75.64.224    | 80   |
 10.75.64.222 | 10.75.64.222    | 81   |
 10.75.64.223 | 10.75.64.223    | 82   |
 10.75.64.230 | 10.75.64.230    | 80   |
 10.75.64.230 | 10.75.64.230    | 81   |
 10.75.64.230 | 10.75.64.230    | 82   |
 10.75.64.230 | 10.75.64.230    | 83   |
 10.75.64.230 | 10.75.64.230    | 84   |

Note: max_consecutive could be the number of consecutive rows, too, not necessarily the upper IP.

Actually, if I can go deeper, I would like to group by port, too. That means the last 5 rows of the result will be summarized in just one like

 10.75.64.230 | 10.75.64.230    | 80   | 84

I’m currently using MySQL, but I can move to PostgreSQL if it is necessary.

Answer :

In Postgres (tested with v9.3) you can use the dedicated inet data type, to store IPv4 addresses with only 7 bytes (or IPv6 with 19 bytes) and with automatic integrity checks and dedicated functions and type casts etc.

Schema

The translated schema could look like this:

CREATE TABLE log (
  id serial PRIMARY KEY
, dst_port int
, src_ip inet
, dst_ip inet
);
CREATE INDEX ON log (dst_port);
CREATE INDEX ON log (src_ip);

I moved to dst_port int to the 2nd position to optimize alignment / padding:

Now we can use standard window functions (not possible in MySQL).

Step 1: Fold groups of consecutive dst_ip for same (dst_port)

One special difficulty: The aggregate function min() / max() are not yet implemented for inet in Postgres 9.4. Both are in the upcoming Postgres 9.5!

So I substituted with DISTINCT ON in the first step:

SELECT DISTINCT ON (dst_port, ip_grp)
       dst_ip, count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct, dst_port
FROM  (
   SELECT dst_ip, dst_port, dst_ip - row_number() OVER (PARTITION BY dst_port
                                                        ORDER BY dst_ip) AS ip_grp
   FROM   log
   ORDER  BY dst_port, dst_ip
   ) sub
ORDER  BY dst_port, ip_grp, dst_ip;

Result as desired – with a count of rows (could be upper IP as well).

You can subtract/add integer from/to the inet type. By subtracting the row_number() all consecutive rows get the same grp – the value of grp is irrelevant, just the fast that it’s the same per partition (dst_port).

Then we can GROUP BY ... – or in this special case DISTINCT ON dst_port, ip_grp. I use another window function to get the count ip_ct in the same step: count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct.

Note that consecutive IPs can cross byte boundaries (see my comment to question).

Detailed explanation for this technique:

Step 2: Fold groups of consecutive dst_port for same (dst_ip, ip_ct)

SELECT dst_ip, ip_ct, min(dst_port) AS dst_port, count(*) AS port_ct
FROM  (
   SELECT *, dst_port - row_number() OVER (PARTITION BY dst_ip, ip_ct
                                           ORDER BY dst_port) AS port_grp
   FROM  (
      SELECT DISTINCT ON (dst_port, ip_grp)
             dst_ip, count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct, dst_port
      FROM  (
         SELECT dst_ip, dst_port, dst_ip - row_number() OVER (PARTITION BY dst_port
                                                              ORDER BY dst_ip) AS ip_grp
         FROM   log
         ORDER  BY dst_port, dst_ip
         ) sub1
      ORDER  BY dst_port, ip_grp, dst_ip
      ) sub2
   ) sub3
GROUP  BY 1, 2, port_grp
ORDER  BY 1, 3, 2;

Basically, repeat the same logic like in the first step, applied to the result of the first step.
But now you have to group on ip_ct additionally. And this time, you can use the simpler min(dst_port), since the port number is a plain integer.

SQL Fiddle demonstrating all.

Leave a Reply

Your email address will not be published. Required fields are marked *