I wonder if it’s possible, in a
SELECT, to filter out rows which the text on that current row is contained in the text of the following row.
For example, image we log “searches” performed by a user, “as they type”:
1 something I searched 2 another search 3 this 4 this is a 5 this is a third 6 this is a third search 7 fourth 8 fourth search 9 here's a ry 10 here's a typo
Rows 3-6 are effectively the same – just the user was a bit “slow” typing it, so we saved multiple times.
Rows 7-8 the same too.
Rows 9-10 are the same, but there was a typo.
I am looking to do a
SELECT that would return the following rows only:
1 something I searched 2 another search 6 this is a third search 8 fourth search 9 here's a ry 10 here's a typo
Is something like this possible?
Note – there’s no guarantee at all that the IDs will be in a sequence for a specific user. So, I can’t simply
JOIN with the table itself, by the
id = id+1. All I can do is
ORDER BY id to get them in order.
You’ll want to use a window function, specifically the
LEAD() function would make great use here like so:
-- Gets the following row's search text and compare it to the current row's search text to see if the following row starts with the current row's WITH _SearchesMarkedAsRedundant ( SELECT id, userId, searchText, IF(LEAD(searchText) OVER (PARTITION BY userId ORDER BY id) LIKE CONCAT(searchText, '%'), 1, 0) IsRedundantSearch FROM Searches ) -- Returns only the rows whose following row did not start with this row's search text SELECT id, userId, searchText FROM _SearchesMarkedAsRedundant WHERE IsRedundantSearch = 0
And if you wanted to globally remove redundant searches, despite which user inputted the search, you can remove the
PARTITION BY clause of the window function like so
IF(LEAD(searchText) OVER (userId ORDER BY id) LIKE CONCAT(searchText, '%'), 1, 0) IsRedundantSearch.
The above will accomplish the output from your given example. Note it specifically only looks at the next 1 row (sorted by the id column) and checks if the that row starts with the same search text as the current row. You can make it do a full contains check instead by adding another wildcard
% symbol to the front of the
searchText in the
CONCAT() operator like so
CONCAT('%', searchText, '%').
LEAD() window function is the key to this solution here, as it can look-ahead at the column inputted by however many rows you specify. The default number of rows it looks-ahead is 1 but if you wanted to look at the 3rd row ahead instead of the next row, then
LEAD(searchText, 3) would look-ahead by 3 rows.