What is the train_test_split function in Sklearn?
The train_test_split function of the sklearn.model_selection package in Python splits arrays or matrices into random subsets for train and test data, respectively.
To use the train_test_split function, we’ll import it into our program as shown below:
from sklearn.model_selection import train_test_split
Syntax
The syntax of the train_test_split function is as follows:
sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)
Parameter values
The train_test_split function accepts the following parameter values:
*arrays: These are the arrays or matrices that need to be split.test_size: This is the size of the test subset. If this parameter is anint, then it represents the number of values that need to be added to the test subset. If this parameter is afloat, then it represents the proportion of the dataset that needs to be added to the test subset.train_size: This is the size of the train subset. Similar to thetest_sizeparameter, thetrain_sizeparameter can either be afloator anint.random_state: This parameter value controls how the data is shuffled before being split.shuffle: This parameter value determines whether or not the data needs to be shuffled before being split.stratify: This parameter value class labels to allow data to be split in a stratified fashion.
Note: A comprehensive description of the aforementioned parameters can be found here.
Return value
The train_test_split function returns a list that contains the train-test splits of the inputs.
Example
The code below shows us how to use the train_test_split function in Python.
from sklearn.model_selection import train_test_split# declare an array of valuesdata = [20, 4, 12, 9, 0, 10]# declare labels associated with each valuelabels = ["A", "B", "B", "A", "C", "A"]# split the data into train-test subsets of equal sizestrain, test = train_test_split(data, test_size=0.5)print("Splitting into equal parts:")print("Train Split:", train)print("Test Split:", test)# split the dataset into train-test subsets of different sizestrain, test = train_test_split(data, test_size=0.2)print("\nSplitting into different parts:")print("Train Split:", train)print("Test Split:", test)# split multiple liststrain_data, test_data, train_labels, test_labels = train_test_split(data, labels)print("\nSplitting multiple lists:")print("Train Data:", train_data)print("Test Data:", test_data)print("Train Labels:", train_labels)print("Test Labels:", test_labels)
Explanation
- Line 1: We import the
train_test_splitfunction from thesklearn.model_selectionlibrary. - Line 4: We initialize an array of values to serve as the data.
- Line 7: We initialize a list of labels that correspond to each value in the
dataarray. - Line 9: We split the
dataarray into equally-sized train and test subsets using thetest_train_splitfunction. The lists returned by the function are output accordingly. - Line 17: We split the
dataarray into differently-sized train and test subsets using thetest_train_splitfunction with the test subset containing 20% of the values. The lists returned by the function are output accordingly. - Line 24: We split both the
dataandlabelsarrays to get all the train and test subsets. The lists returned by the function are output accordingly.