I have a MySQL-DB containing 2 tables “records” and “features” linked by one foreign key in features referring to the primary key “id” in records.
My goal is to create a randomized but fixed subset of the data in the DB for machinelearning that can be dumped and hashed for reproducability.
So I thought it would be handy if I could draw a randomized subset of the data and move it to a seperate DB.
Is that possible and if yes: how?
Add another column to the table. It could be
BIGINT, depending on size, etc. Populate it with some function of
RAND(). Also, index the column.
Now you can fetch rows that are consecutive in this new column; they will be “random” from the point of view of the original data. And it will be reproducible.
SELECT * FROM tbl WHERE myrand > $x ORDER BY myrand LIMIT 200000;
In this case I am allowing
$x to give you different subsets of the table — each reproducible. You would need to adjust $x to avoid overlapping between one subset and another (if that matters).
Note: I did not copy the data over, merely fetched directly from the original data. This may be an added benefit. Or it simply gives you the way to ‘copy’ the rows.
In the meantime I found a solution that at least serves the main goal, but it is not a very direct solution and quite slow as I am far from proficiency with RDBMS so if anyone can give a better answer that doesn’t involve so many single steps I will accept that but until then, this is what I ended up doing:
1) As backup of a randomized fixed sample set was the greater goal, it was sufficient to first store the information in seperate tables in the same DB so I created two tables records_rand and features_rand of the same structure as the originals. (How to copy entire tables to another DB can be found easily via web-search.)
2) Draw random sample from the “main-record-table” and insert into the new record_rand-table.
INSERT INTO records_rand SELECT * FROM records_agg ORDER BY RAND() LIMIT 200000;
3) Loop over the newly inserted records and fill the new feature-table with the associated feature-data (this was quite timeconsuming >1h to copy around 1.8M rows out of around 10M rows):
DROP PROCEDURE IF EXISTS copyFeatures; DELIMITER ;; CREATE PROCEDURE copyFeatures() BEGIN DECLARE currentId INT DEFAULT 0; DECLARE n INT DEFAULT 0; DECLARE i INT DEFAULT 0; SELECT COUNT(*) FROM records_rand INTO n; SET i=0; WHILE i<n DO SET currentId = (SELECT id FROM records_rand LIMIT i,1); INSERT INTO features_rand SELECT * FROM features_agg WHERE currentId=record_id; SET i = i + 1; END WHILE; End; ;; DELIMITER ; CALL copyFeatures();
( 4) MySQL-Workbench allows choosing of tables which should be exported in a dump so it wasn’t really necessary anymore to move the data to a new DB. )