Question :
I have my database in which I have to store new data on daily basis fetching from API. The problem is that I have to truncate table everyday and then insert newly fetched records in same table. Records count would be around 10k. It takes 5-10 seconds to insert new records.
So my concern is that if I truncate the table and some users are requesting at a time so he will end up with error because I truncated the table. So can you suggest what should I do to tackle this problem? I am not a DBA so please suggest what should be the better way to do it.
Answer :
Instead, do
CREATE TABLE new LIKE real;
load new data into `new` by whatever means
RENAME TABLE real TO old, new TO real;
DROP TABLE old;
Notes:
- The second step is the only slow step.
- The
RENAME
is atomic. - You can continue to read the table during the process.
- Do not use any transactions or table locks.
You can use 2 table with the same structure and then use RENAME. RENAME is atomic and so it’s not interruptible.
RENAME TABLE t1 TO tmp_table,
t2 TO t1,
tmp_table TO t2;
TRUNCATE Table t2;
Using MySQL’s Exchanging Partitions and Subpartitions with Tables you can work on a seperate identical table in your own time. When it’s ready, you can then swap that in for a partition in your “real” table. The catch is that you have to create a partition of your table, but only one is required. Also from the reference mentioned (nt
here is your table):
Table
nt
contains no foreign key references, and no other table has
any foreign keys that refer tont
.
Example below is close to those of in the article, but given for completeness:
This is the “real” table:
CREATE TABLE stackoverflow (
id INT PRIMARY KEY,
test VARCHAR(30)
);
Create your new table before partitioning your “real” table. Then you don’t have to remove anything from your copy:
CREATE TABLE stackoverflow_workinprogess LIKE stackoverflow;
Create just one partition on your “real” table:
ALTER TABLE stackoverflow PARTITION BY RANGE (id) (
PARTITION p VALUES LESS THAN (MAXVALUE)
);
Inserting some random data for the example:
INSERT INTO stackoverflow VALUES
(1, "Test 1"),
(2, "Test 2"),
(3, "Test 3"),
(4, "Test 4"),
(5, "Test 5");
Now you can work on your data in the stackoverflow_workinprogess table:
TRUNCATE TABLE stackoverflow_workinprogess;
INSERT INTO stackoverflow_workinprogess VALUES
(1, "Test a - new row"),
(2, "Test b - new row"),
(3, "Test c - new row"),
(4, "Test d - new row"),
(6, "Test e - new row"), -- skipping one on purpose
(7, "Test f - new row");
When it’s all done, swap/exchange the partitions:
ALTER TABLE stackoverflow EXCHANGE PARTITION p WITH TABLE stackoverflow_workinprogess;
Result:
select * from stackoverflow;
id | test
----|-----------------
1 | Test a - new row
2 | Test b - new row
3 | Test c - new row
4 | Test d - new row
6 | Test e - new row
7 | Test f - new row
It’s all included in this dbfiddle.uk
I didn’t try to play around with creating the partition on the stackoverflow_workinprogess
table, but I’d be surprised if the results were any different.
A table lock might work. Getting a table lock would cause other users to wait or timeout. However, for 5-10 seconds, that may be too long.
Explore if a transaction will work. Start a transaction with a BEGIN WORK
statement and when done, do a COMMIT
.
It looks like you are truncating records so that you can refresh the data from an API. However, this is causing problems because some people are using those records.
A better approach would be to use DELETE
rather than TRUNCATE
, so that you can wrap it in a transaction. I believe that table truncations are effected immediately, and so you will need the (slower) delete. It’d be something like this:
- Start a transaction
- Delete all records
- Insert new records
- Finish the transaction
5-10 seconds to insert 10K new records is probably fine – users that are using the system during this period will just see the old records.
From your description (~10k records), unless the records are yuuuge, I don’t understand why you don’t keep a few days’ worth of data in there (with a date column that’s part of the primary key) and have queries work against that.
The deletion can then be scheduled to run any time later after you are sure all existing queries against it have finished – which might not usually be long.