Filter out rows which the text on that current row is contained in the text of the following row

Posted on

Question :

I wonder if it’s possible, in a SELECT, to filter out rows which the text on that current row is contained in the text of the following row.

For example, image we log “searches” performed by a user, “as they type”:

1  something I searched
2  another search
3  this
4  this is a
5  this is a third
6  this is a third search
7  fourth
8  fourth search
9  here's a ry
10 here's a typo

Rows 3-6 are effectively the same – just the user was a bit “slow” typing it, so we saved multiple times.
Rows 7-8 the same too.
Rows 9-10 are the same, but there was a typo.

I am looking to do a SELECT that would return the following rows only:

1  something I searched
2  another search
6  this is a third search
8  fourth search
9  here's a ry
10 here's a typo

Is something like this possible?

Note – there’s no guarantee at all that the IDs will be in a sequence for a specific user. So, I can’t simply JOIN with the table itself, by the id = id+1. All I can do is ORDER BY id to get them in order.

Answer :

You’ll want to use a window function, specifically the LEAD() function would make great use here like so:

-- Gets the following row's search text and compare it to the current row's search text to see if the following row starts with the current row's
WITH _SearchesMarkedAsRedundant
(
    SELECT 
        id, 
        userId,
        searchText,             
        IF(LEAD(searchText) OVER (PARTITION BY userId ORDER BY id) LIKE CONCAT(searchText, '%'), 1, 0) ​IsRedundantSearch
    FROM Searches
)

-- Returns only the rows whose following row did not start with this row's search text
SELECT id, userId, searchText
FROM _SearchesMarkedAsRedundant
WHERE IsRedundantSearch = 0

And if you wanted to globally remove redundant searches, despite which user inputted the search, you can remove the PARTITION BY clause of the window function like so IF(LEAD(searchText) OVER (userId ORDER BY id) LIKE CONCAT(searchText, '%'), 1, 0) ​IsRedundantSearch.

The above will accomplish the output from your given example. Note it specifically only looks at the next 1 row (sorted by the id column) and checks if the that row starts with the same search text as the current row. You can make it do a full contains check instead by adding another wildcard % symbol to the front of the searchText in the CONCAT() operator like so CONCAT('%', searchText, '%').

The LEAD() window function is the key to this solution here, as it can look-ahead at the column inputted by however many rows you specify. The default number of rows it looks-ahead is 1 but if you wanted to look at the 3rd row ahead instead of the next row, then LEAD(searchText, 3) would look-ahead by 3 rows.

Leave a Reply

Your email address will not be published.