Challenge. Split database 500 records in two smaller datasets, randomly

Posted on

Question :

I have a database with 500 records. I want to split these records to 75% and 25% *randomly*in order to use the different datasets for training and testing to machine learning algorithms. Does anyone knows how to do that using an sql query?
ps. I use mysql.

In addition I find this …
select …
from …
where …
ORDER BY rand() LIMIT 150 but this gives me only the first dataset

Thanks in advance

Answer :

Easiest solution would be to store the datasets into tables.
After you use your query to get the first dataset:

 INSERT INTO dataset1 SELECT * FROM foo ORDER BY rand() LIMIT 125

then do:

 INSERT INTO dataset2 SELECT foo.* FROM foo 
   LEFT JOIN dataset1 d1 ON foo.key=d1.key WHERE d1.key IS NULL

Your queries are going somewhere. Therefore you can get all the data in one go and tag them differently, e.g. (SQL Fiddle)

select if(rownum>=@rn*0.75,1,2) part, q3.*
from (
  select @rn:=@rn+1 rownum, q2.*
  from (select @rn:=0) q1,
       (select * from random500 order by rand()) q2
) q3
order by part, something;

If the data never leaves the server, you can create a temporary table with the additional part column, e.g.

create temporary table tmp as
select if(rownum>=@rn*0.75,1,2) part, q3.*
from (
  select @rn:=@rn+1 rownum, q2.*
  from (select @rn:=0) q1,
       (select * from random500 order by rand()) q2
) q3
order by part, something;

And use it within the same session, e.g. select ... from tmp where part = 1

Otherwise, you can create it as a normal table and dispose of it later.

Leave a Reply

Your email address will not be published. Required fields are marked *