How to shuffle Dataframe rows

We have multiple options to shuffle pandas DataFrame rows in Python. In short, the pandas, scikit-learn, and NumPy libraries provide methods that we can use to shuffle rows in our dataset.

pandas

The DataFrame.sample() method resamples the dataset, thereby shuffling it. The method accepts the argument frac, which determines the fraction of the dataset required. It also accepts the parameter random_state, which allows us reproducible results.

scikit-learn

The shuffle() method in the sklearn.utils module will perform random permutations of the rows of the dataset you provide as input. This method also accepts the random_state parameter that allows us to produce reproducible results.

NumPy

We can use np.random.permutation() to produce a vector of randomly ordered numbers of the length of the DataFrame. This vector can then shuffle the dataset using the DataFrame.iloc method that selects rows based on index locations.

Note: We can use the DataFrame.reset_index() method so that the newly assigned indices are retrained, and the old ones are dropped.

Example

from sklearn.datasets import load_iris
from sklearn.utils import shuffle
import pandas as pd
import numpy as np
# Load a dataset
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
print("Original Dataset:")
print(df.head())
# Using pandas
df1 = df.sample(frac=1).reset_index(drop=True)
print("Dataset shuffled using pandas:")
print(df1.head())
# Using scikit-learn
df2 = shuffle(df)
print("Dataset shuffled using Sklearn:")
print(df2.head())
# Using NumPy
df3 = df.iloc[np.random.permutation(len(df))].reset_index(drop=True)
print("Dataset shuffled using NumPy:")
print(df3.head())

Code

Line 1–4: We import the relevant libraries.
Line 7–8: We load the Iris dataset from the sklearn library.
Line 11: We use pandas to shuffle the dataset.
Line 15: We use the sklearn.shuffle() method to shuffle the dataset.
Line 19: We use the NumPy library to generate a list of random index numbers, which we then use as indices for selecting rows with the DataFrame.iloc method.

Execution

Let's take some time and examine the speed of these three methods. In all three examples, we will use the time library to get the running time of 10,000 runs of each shuffle method. We will then report the time on ms/run. Execution times are important in practice when we want to write code that can easily scale up for big data.

pandas

In pandas, we have the built in sample() method that we can use to shuffle the data

How to shuffle Dataframe rows

pandas

scikit-learn

NumPy

Example

Code

Execution

pandas

scikit-learn

NumPy

Conclusion