Mysql query time massively affected by limit

Posted on

Question :

I have this mysql query, where I am getting 250ms query time when the limit is 500.

If the limit is 1000, it’s 1,5 seconds

if the limit is 1300 , it’s 2,5 seconds

select hotel_code,
               point(longitude, latitude),
               point(-74.007227, 40.713054)
           ) * 0.000621371192 as distance
from hotel_parser_links hpl
where parser_code = 'ST'
having distance is not null
   and distance < 30

Here is the table structure, it has around 500k rows

create table hotel_parser_links
    id bigint unsigned auto_increment
        primary key,
    hotel_id bigint unsigned not null,
    parser_code varchar(255) not null,
    hotel_code varchar(255) not null,
    finished_parsing_at datetime null,
    started_parsing_at datetime null,
    city varchar(255) null,
    longitude decimal(11,8) null,
    latitude decimal(11,8) null,
    coordinate point not null

create spatial index hotel_parser_links_coordinate_index
    on hotel_parser_links (coordinate);

create index hotel_parser_links_hotel_code_index
    on hotel_parser_links (hotel_code);

create index hotel_parser_links_latitude_index
    on hotel_parser_links (latitude);

create index hotel_parser_links_longitude_index
    on hotel_parser_links (longitude);

create index hotel_parser_links_parser_code_index
    on hotel_parser_links (parser_code);

Here is the explain output

| id | select_type | table | partitions | type | possible_keys                        | key                                  | key_len | ref   | rows   | filtered | Extra |
|  1 | SIMPLE      | hpl   | NULL       | ref  | hotel_parser_links_parser_code_index | hotel_parser_links_parser_code_index | 1022    | const | 270853 |   100.00 | NULL  |

Would appreciate any help 🙂

Answer :

The bigger the limit, the more of the 500K rows it will have to scan. Hence the longer the query will take.

Yes, I see that the growth is not “linear”; there are many possible reasons for that, but let’s ignore that and go for what you really want — the query to be faster.

See the following for several ways to speed up the query. In particular, pick the SPATIAL option.

Your indexes on lat and lng are not useful for this query, and tend to be not useful if you added a “bounding box” constraint in the WHERE clause. However, INDEX(lat, lng), INDEX(lng, lat) is useful for a bounding box.

But the real problem seems to be that the Spatial test is in HAVING. The Spatial index can be used if you move that test into the WHERE clause. Yes, you will have to repeat the expression; too bad.

PS: distance is not null is redundant; the test < 30 fails for NULL.

Since there is no test on lat or lng, INDEX(parser_code, lat, lng) won’t help. You could add a bounding box test; that might let that composite index be even faster than SPATIAL; that would depend on the cardinality of parser_code.

The explain seems to show it’s using the index on parser_code, but parser_code ‘ST’ is used in more than half of the records, so not what you’d call “selective”.
Drop the index or hide the column in a function. Also, I’d replace the having by a where.

with H as (
 select hotel_code,
           point(longitude, latitude),
           point(-74.007227, 40.713054)
       ) * 0.000621371192 as distance 
   from hotel_parser_links hpl
   where concat(parser_code,'') = 'ST'
select * from H
  where  distance < 30;

What’s the difference between FTS and IS.

Selecting data via an index, the DB finds the first entry corresponding to your criteria, finds the block in the table, containing the record corresponding to that entry, finds the record in the block and runs the remaining checks to see if all of your criteria are satisfied, if so, returns the record. Reads the next index entry, may need to find another block in the table, finds the record in the block, etc.

When physical disks are involved, this can be dreadfully slow, unless the DB or the disk has memory cache. They all do, nowadays. But even using SSD, you see there’s a lot of processing involved.

Compared to a sequential, full table scan, where the DB reads block after block, record after record, no searching or positioning, only criteria validation, which had to be done with an index scan also. You could say he has to run that validation for all records, instead of for a portion. In your case, he’s only got to compare the parser_code before he can discard 230K records. A laughable effort compared to the index logic.

Leave a Reply

Your email address will not be published. Required fields are marked *