How to query a table efficiently without affecting live transactions

Posted on

Question :

I have a high transaction table with millions of records with the structure below:

transaction
--------------
id             int
txn_status     varchar
amount         bigint
name           varchar
txn_time       datetime --Date and time of the transaction

The txn_status field value can be completed or Pending. I need to run a query that will retrieve the number of records with the following characteristics:

  1. Transactions with Pending txn_status since the last 15 minutes i.e (current_time - txn_time) <= 15 minutes
  2. Transactions with Pending txn_status between the last 16 and 25 minutes i.e (current_time - txn_time) >= 16 minutes and (current_time - txn_time) <= 25 minutes
  3. Transactions with Pending txn_status between the last 26 and 30 minutes i.e (current_time - txn_time) >= 26 minutes and (current_time - txn_time) <= 30 minutes
  4. Transactions with Pending txn_status more than 30 minutes i.e (current_time - txn_time) > 30 minutes

There are two solutions currently in my head currently:

  1. Query the production database every minute using a case statement (the case statement will filter and group the records based on the time difference above) in the where clause of the query.

  2. To avoid putting much load on the production database, have a separate database with a similar table and have an update and insert triggers update the table and execute the query in No. 1 above afterwards on the database.

Please if there is any optimal or better solution in achieving this without putting much load on the database, please share.

Answer :

Typically, you don’t want to be querying tables with high transaction volume at all; that’s what a data warehouse is for.

But if you must, you can use SNAPSHOT ISOLATION. Which will query the table as of the last committed transaction before it started. This might give you dirty reads, as it may be out of date by the time the query finished running, but as you’re looking at stuff 15-30 minutes in the past it should be ok.

In MySQL (assuming you’re using innoDB), you can use START TRANSACTION WITH CONSISTENT SNAPSHOT; as there is no global setting for SNAPSHOT ISLOATION.

You can also look here for more info: https://dev.mysql.com/doc/refman/5.7/en/innodb-consistent-read.html

First
txn_status should be an tinyint with FK and index

Do you need bigint for amount

I trust there is an index on date

And consider 4 different queries
In the three below the first was 86% of the cost
Case was an index scan
Where is an index seek
Assign current_time to a variable so current time is the same for all
The older transactions are going to be in pages most likely not use by inserts
You would only get contention the newest – make that short and sweet
This was tested on MS SQL-Server

select count(CASE WHEN sID <= 1000 THEN 1 END) as '1000'
     , count(CASE WHEN sID >  1000 and sID <= 2000 THEN 1 END) as '2000'
from docSVsys 

select count(*) as ttl 
from docSVsys 
where sID <= 1000

select count(*) as ttl 
from docSVsys 
where sID > 1000 and sID <= 2000

Do not hide indexed columns inside expressions: (current_time - txn_time) <= 15 minutes. Instead:

WHERE txn_time >= NOW() - 15 MINUTE

(This assumes you have an index on txn_time.)

“between the last 26 and 30 minutes” needs

WHERE txn_time >= NOW() - 30 MINUTE
  AND txn_time  < NOW() - 26 MINUTE

(etc)

If you want to discuss further please provide SHOW CREATE TABLE and EXPLAIN SELECT ....

Leave a Reply

Your email address will not be published. Required fields are marked *