Three ways of splitting Train and Test in RStudio

Three ways of splitting Train and Test in RStudio


 I have put together 👌 ways of splitting the dataset before running the model. 

    1. ifelse

Will create and attach a new column named ‘train’ to the dataset. The runif() function will generate random values from a Uniform Distribution. The number of values this funtion will generate is the number of rows of the dataset. By random it generates 0 as minimum value and 1 as the maximum. If a value generated will be smaller than 0.8, the ifelse() function will assign the value 1 to the respective row, and if is bigger than 0.8 it will assign the value 0. So, we have 80% of 1 value and 20% of 0 value. Beautiful!

Creating the training set and test set from the rows that have the ‘train’ value equal to 1 and equal to 0, respectively to trainset and testset.

I will need to remove the ‘train’ column from the dataset before running the prediction model, as it is needed only for the separation of the data. Finding the index of the ‘train’ column with the grep() function and after removing from both trainset and testset as below.



Painting of Kandinsky for website decoration purposes
Yellow, Red, Blue - Wassily Kandinsky
Click on the picture for a print on high quality matte paper👌use KERAS for 20% discount
 

    2. createDataPartition

For splitting the data to the train and test set we use the createDataPartition function, part of the 'caret' library. It takes the arguments p=0.8 which means what part of the data goes to training and the list which in this case is false which means I don’t want the data displayed as a list but as a matrix.


    3. sample


The sample function is part of 'class' library in RStudio. It is creating random values within the range of the number of rows of the dataset, and the number of values will be 80% of the all values of the rows.


Drop me an email if you have any questions 📧


Vassily Kandinsky - Circles in a Circle




Comments

Popular posts from this blog

Convolutional neural network classification of brain MRI

Classification using Support Vector Machines and K-Nearest Neighbours algorithms in RStudio.