Question :
I have a query as part of my process, that I execute hundreds of times in a loop.
Initially, Table A contains all records (20mil). Table B contains 0 records.
Primary key in both tables ID
the I execute:
select * from table A where a.ID not in (select ID from table b) limit 10000
##magic stuff in python
insert everything to table B, once again, .
Initially, the query runs super fast, but after Nth loop (100th+), size of table B increases, to the point where I it takes a bit of time to perform the NOT IN operation.
Does anyone have recommendations on how I can speed up the query?
– So far, I’ve tweaked the default mysql bugger to be 1.5gbs (ids are pretty small INTs, so that should be enough).
Caveats:
1) One way to do this would be to remove * from table A after I’ve processed them. However, I want to keep table A in tact.
… only method I could think of I adding another column to table A (which I’d index) called ‘PROCESSED’… then update that column with a second query once the records have been processed/posted, but I was hoping there was an easier solution.
Thank you all in advance.
Answer :
If your goal is to look at every row in A
and do something with it, there is a much more efficient way. (It seems that B
is merely there to see what you have already processed.)
The reason for it getting slower is that it has to do more work as it gets farther into A
— namely skipping over the rows it has processed. A processed
flag might suffer the same malady.
So…
Walk through A
processing chunks as you go. Then remember where you left off so that the next 10000 will be right there, no searching. I discuss that in more detail with an eye to DELETEing
, but it can be adapted for other purposes: http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks With that, B
is unnecessary.
Partitioning
If you are thinking about partitioning the data by months, I have to ask you “Why?”. Here are some answers:
- For performance? You won’t get any.
- For rapid deletion of “old” data? This is a good use case, but be sure to use
PARTITION BY RANGE(...)
and include the year, too. More: http://mysql.rjweb.org/doc.php/partitionmaint