We have multiple options to shuffle pandas DataFrame rows in Python. In short, the pandas
, scikit-learn
, and NumPy
libraries provide methods that we can use to shuffle rows in our dataset.
The DataFrame.sample()
method resamples the dataset, thereby shuffling it. The method accepts the argument frac
, which determines the fraction of the dataset required. It also accepts the parameter random_state
, which allows us reproducible results.
The shuffle()
method in the sklearn.utils
module will perform random permutations of the rows of the dataset you provide as input. This method also accepts the random_state
parameter that allows us to produce reproducible results.
We can use np.random.permutation()
to produce a vector of randomly ordered numbers of the length of the DataFrame. This vector can then shuffle the dataset using the DataFrame.iloc
method that selects rows based on index locations.
Note: We can use the
DataFrame.reset_index()
method so that the newly assigned indices are retrained, and the old ones are dropped.
from sklearn.datasets import load_irisfrom sklearn.utils import shuffleimport pandas as pdimport numpy as np# Load a datasetdata = load_iris()df = pd.DataFrame(data.data, columns=data.feature_names)print("Original Dataset:")print(df.head())# Using pandasdf1 = df.sample(frac=1).reset_index(drop=True)print("Dataset shuffled using pandas:")print(df1.head())# Using scikit-learndf2 = shuffle(df)print("Dataset shuffled using Sklearn:")print(df2.head())# Using NumPydf3 = df.iloc[np.random.permutation(len(df))].reset_index(drop=True)print("Dataset shuffled using NumPy:")print(df3.head())
Line 1–4: We import the relevant libraries.
Line 7–8: We load the Iris dataset from the sklearn
library.
Line 11: We use pandas
to shuffle the dataset.
Line 15: We use the sklearn.shuffle()
method to shuffle the dataset.
Line 19: We use the NumPy
library to generate a list of random index numbers, which we then use as indices for selecting rows with the DataFrame.iloc
method.
Let's take some time and examine the speed of these three methods. In all three examples, we will use the time
library to get the running time of 10,000 runs of each shuffle method. We will then report the time on ms/run
. Execution times are important in practice when we want to write code that can easily scale up for big data.
In pandas, we have the built in sample()
method that we can use to shuffle the data
import timefrom sklearn.datasets import load_irisimport pandas as pd# Load a datasetdata = load_iris()df = pd.DataFrame(data.data, columns=data.feature_names)tik = time.time()for i in range(10000):df = df.sample(frac=1).reset_index(drop=True)tok = time.time()print("Time taken: "+str((tok-tik)/10)+"ms")
We have the shuffle()
method in sklearn.utils
that we can use to shuffle the DataFrame.
from sklearn.datasets import load_irisfrom sklearn.utils import shuffleimport pandas as pdimport time# Load a datasetdata = load_iris()df = pd.DataFrame(data.data, columns=data.feature_names)tik = time.time()for i in range(10000):df = shuffle(df).reset_index(drop=True)tok = time.time()print("Time taken: "+str((tok-tik)/10)+"ms")
Finally, we have the permutation()
method in numpy.random
that we can use to shuffle a list of indices of the DataFrame. We can then select the DataFrame rows using this permuted list, which will shuffle the DataFrame.
from sklearn.datasets import load_irisimport pandas as pdimport numpy as npimport time# Load a datasetdata = load_iris()df = pd.DataFrame(data.data, columns=data.feature_names)tik = time.time()for i in range(10000):df = df.iloc[np.random.permutation(len(df))].reset_index(drop=True)tok = time.time()print("Time taken: "+str((tok-tik)/10)+"ms")
As we can see, the scikit-learn
shuffle()
takes the longest time, while the shuffle using the pandas
built-in sample()
sample method takes the shortest.
Free Resources