How to shuffle Dataframe rows
We have multiple options to shuffle pandas DataFrame rows in Python. In short, the pandas, scikit-learn, and NumPy libraries provide methods that we can use to shuffle rows in our dataset.
pandas
The DataFrame.sample() method resamples the dataset, thereby shuffling it. The method accepts the argument frac, which determines the fraction of the dataset required. It also accepts the parameter random_state, which allows us reproducible results.
scikit-learn
The shuffle() method in the sklearn.utils module will perform random permutations of the rows of the dataset you provide as input. This method also accepts the random_state parameter that allows us to produce reproducible results.
NumPy
We can use np.random.permutation() to produce a vector of randomly ordered numbers of the length of the DataFrame. This vector can then shuffle the dataset using the DataFrame.iloc method that selects rows based on index locations.
Note: We can use the
DataFrame.reset_index()method so that the newly assigned indices are retrained, and the old ones are dropped.
Example
from sklearn.datasets import load_irisfrom sklearn.utils import shuffleimport pandas as pdimport numpy as np# Load a datasetdata = load_iris()df = pd.DataFrame(data.data, columns=data.feature_names)print("Original Dataset:")print(df.head())# Using pandasdf1 = df.sample(frac=1).reset_index(drop=True)print("Dataset shuffled using pandas:")print(df1.head())# Using scikit-learndf2 = shuffle(df)print("Dataset shuffled using Sklearn:")print(df2.head())# Using NumPydf3 = df.iloc[np.random.permutation(len(df))].reset_index(drop=True)print("Dataset shuffled using NumPy:")print(df3.head())
Code
Line 1–4: We import the relevant libraries.
Line 7–8: We load the Iris dataset from the
sklearnlibrary.Line 11: We use
pandasto shuffle the dataset.Line 15: We use the
sklearn.shuffle()method to shuffle the dataset.Line 19: We use the
NumPylibrary to generate a list of random index numbers, which we then use as indices for selecting rows with theDataFrame.ilocmethod.
Execution
Let's take some time and examine the speed of these three methods. In all three examples, we will use the time library to get the running time of 10,000 runs of each shuffle method. We will then report the time on ms/run. Execution times are important in practice when we want to write code that can easily scale up for big data.
pandas
In pandas, we have the built in sample() method that we can use to shuffle the data
import timefrom sklearn.datasets import load_irisimport pandas as pd# Load a datasetdata = load_iris()df = pd.DataFrame(data.data, columns=data.feature_names)tik = time.time()for i in range(10000):df = df.sample(frac=1).reset_index(drop=True)tok = time.time()print("Time taken: "+str((tok-tik)/10)+"ms")
scikit-learn
We have the shuffle() method in sklearn.utils that we can use to shuffle the DataFrame.
from sklearn.datasets import load_irisfrom sklearn.utils import shuffleimport pandas as pdimport time# Load a datasetdata = load_iris()df = pd.DataFrame(data.data, columns=data.feature_names)tik = time.time()for i in range(10000):df = shuffle(df).reset_index(drop=True)tok = time.time()print("Time taken: "+str((tok-tik)/10)+"ms")
NumPy
Finally, we have the permutation() method in numpy.random that we can use to shuffle a list of indices of the DataFrame. We can then select the DataFrame rows using this permuted list, which will shuffle the DataFrame.
from sklearn.datasets import load_irisimport pandas as pdimport numpy as npimport time# Load a datasetdata = load_iris()df = pd.DataFrame(data.data, columns=data.feature_names)tik = time.time()for i in range(10000):df = df.iloc[np.random.permutation(len(df))].reset_index(drop=True)tok = time.time()print("Time taken: "+str((tok-tik)/10)+"ms")
Conclusion
As we can see, the scikit-learn shuffle() takes the longest time, while the shuffle using the pandas built-in sample() sample method takes the shortest.
Free Resources