Question :
I have a database with 500 records. I want to split these records to 75% and 25% *randomly*in order to use the different datasets for training and testing to machine learning algorithms. Does anyone knows how to do that using an sql query?
ps. I use mysql.
In addition I find this …
select …
from …
where …
ORDER BY rand() LIMIT 150 but this gives me only the first dataset
Thanks in advance
Answer :
Easiest solution would be to store the datasets into tables.
After you use your query to get the first dataset:
INSERT INTO dataset1 SELECT * FROM foo ORDER BY rand() LIMIT 125
then do:
INSERT INTO dataset2 SELECT foo.* FROM foo
LEFT JOIN dataset1 d1 ON foo.key=d1.key WHERE d1.key IS NULL
Your queries are going somewhere. Therefore you can get all the data in one go and tag them differently, e.g. (SQL Fiddle)
select if(rownum>=@rn*0.75,1,2) part, q3.*
from (
select @rn:=@rn+1 rownum, q2.*
from (select @rn:=0) q1,
(select * from random500 order by rand()) q2
) q3
order by part, something;
If the data never leaves the server, you can create a temporary table with the additional part
column, e.g.
create temporary table tmp as
select if(rownum>=@rn*0.75,1,2) part, q3.*
from (
select @rn:=@rn+1 rownum, q2.*
from (select @rn:=0) q1,
(select * from random500 order by rand()) q2
) q3
order by part, something;
And use it within the same session, e.g. select ... from tmp where part = 1
Otherwise, you can create it as a normal table and dispose of it later.