How to do a fast simple query on a large table of postcodes

Posted on

Question :

I have a table with 2.5 Millions of Postcodes.

I have to populate a textbox while the user is searching for the Postcode so the query has to be very fast.

The table is composed by:

 ID (Primary key), Postcode (Unique), Population INT(11) NOT NULL

The query is:

SELECT * FROM `postcodes` WHERE substr(Postcode,1,LENGTH_OF_TERM) ='TERM'  order by population desc limit 0,10

TERM is what the user has typed and LENGTH_OF_TERM is the lenght of the TERM .

Despite i have created also an index on Postcode the query is very slow (3-4 seconds).

How i can improve the speed of the query?

Answer :

Adding a postcode, population combined index might help with the ORDER BY part of the SQL.

ALTER TABLE postcode ADD KEY postcode_population (postcode, population)

The main problem is that running a function over a column means its unable to use and index. So here should be an equivalent query that has no function over postcode.

SELECT * FROM `postcodes` WHERE Postcode LIKE 'TERM%' order by population desc limit 0,10

You can see which index is used with EXPLAIN {query}. MySQL only used one index per table in the expression.

What you’re asking for can be done. However, I would recommend reading the list of postal codes into a collection (in Java, ArrayList) in the background as your app starts. Design a Textbox or ComboBox extension to work with the list, narrowing the displayed list of codes as the user enters characters. You may want to ask around other SE lists to see if someone has such a thing already designed in the platform objects you use.

You can perform a huge amount of in-memory calculation and manipulation in the time it takes to perform one query, no matter how well the query has been tuned. I’m talking microseconds vs milliseconds. That’s three orders of magnitude. A supersonic jet is flying only two orders of magnitude faster than a man is running.

You’re simply not going to get the response you want trying to perform I/O between keystrokes.

danblack is on the right track, but it will be much too slow when TERM is only one character. This is because the composite index, though “covering”, can use only the first column (postcode) since it is a ‘range’ (LIKE), not a simple =.

Here’s a workaround…

Notice that LENGTH(TERM) = 1, then fetch the 10 rows from a specially built table:

prefix CHAR(1) NOT NULL,
population FLOAT NOT NULL,
city VARCHAR(...) NOT NULL
PRIMARY KEY(prefix, population)

Put only the top 10 cities for each initial character. Fetch this way:

SELECT city FROM table1
    WHERE prefix = $term   -- no LIKE, etc needed
    ORDER BY population DESC;
    -- No need for `LIMIT` since it was populated with only 10 each.

Then check to see if 2-character prefixes run fast enough. If too slow, then build a table for that case.

For that matter, since all the data is relatively static, build a single table with at most 10 rows for each case of each length. Better yet, 1 row:

prefix VARCHAR(10) CHARACTER SET ascii COLLATE ascii_general_ci NOT NULL
-- population no longer needed
cities VARCHAR(333) NOT NULL  -- pre-built commalist of 10 cities
PRIMARY KEY (term)  -- pop not needed any more

SELECT cities FROM table_n
    WHERE prefix = $term
    -- ORDER BY no longer needed

To populate the table (after the next census) takes 10 similar steps:

INSERT INTO table_n
    SELECT LEFT(term, 1) AS prefix,
           GROUP_CONCAT(city ORDER BY population DESC) AS cities
        FROM census
        GROUP BY prefix;
INSERT INTO table_n
    SELECT LEFT(term, 2) AS prefix,
        ...
INSERT INTO table_n
    SELECT LEFT(term, 3) AS prefix,
        ...
...

But, there is a problem: We need LIMIT 10 in the GROUP_CONCAT, but the syntax does not allow for such. This can be resolved with something like

SUBSTRING_INDEX(GROUP_CONCAT(...), ',', 10) AS cities

SELECT … WHERE substr(Postcode,1,LENGTH_OF_TERM) =’TERM’

Despite i have created also an index on Postcode the query is very slow (3-4 seconds).

The way the query is put together prevents the database from using an index on Postcode.

You’ve asked it to locate records based on the result of a function call based on the value of the Postcode field. The only way it can work out the result of that calculation for each row is to call the function for each and every row, scanning sequentially through the table. (a.k.a. “Table Scannng”, a.k.a. “Slow”).

Instead, try the LIKE operator:

SELECT ... 
WHERE Postcode LIKE 'TERM' || '%' 

This will locate any postcode starting with the given term and allows the database to use an Index on Postcode (because it can Range Scan items in the Index that start with the given fragment).

Leave a Reply

Your email address will not be published. Required fields are marked *