How to slice columns in pandas
pandas is a powerful Python open-source library for performing Exploratory Data Analysis (EDA) tasks and manipulating large datasets. It provides efficient tools for slicing, filtering, aggregating, and pivoting data.
Overall, pandas is an essential tool for anyone working with data, whether they are a data scientist, analyst, or researcher.
Data slicing
Data slicing is a powerful technique that simplifies the analysis of large and complex datasets. This technique breaks down massive amounts of data into smaller, more manageable subsets, enabling us to extract meaningful insights more efficiently. By focusing on specific subsets of data, data slicing helps identify specific patterns and trends, facilitating the elimination of noise and irrelevant data.
Setting up a DataFrame for slicing
To begin slicing data in pandas, the first step is to import the pandas library. Once imported, we will create a sample DataFrame df with four rows and four columns.
Code example
The code below imports the pandas library into Python and creates a sample data to slice:
import pandas as pddf = pd.DataFrame({"a": [1, 2, 3, 4],"b": [2, 3, 4, 5],"c": [3, 4, 5, 6],"d": [4, 5, 6, 7]})print(df)
Code explanation
Line 1: We import the
pandaslibrary.Lines 2–5: We create a sample DataFrame
dfby calling theDataFrame()method frompandas.Line 6: We print the sample DataFrame to the console using the
print()statement.
Methods for slicing columns in pandas
After creating the sample DataFrame, there are several techniques available in pandas to perform slicing operations. These include using reindex, the [] notation, and the .loc[] and .iloc[]methods. Each of these methods has its own benefits and limitations, depending on the specific requirements of the data analysis task. We'll explore each of these techniques in detail and demonstrate how they can be used effectively to slice columns in the DataFrame.
Slicing a column using reindex
Slicing a column using reindex can be useful in situations where we want to rearrange the order of the rows and columns in a DataFrame based on a specific column, or if we want to select only certain columns and keep their order intact.
Code example
The code below selects the column b from the original DataFrame df and stores it to the new DataFrame df_slice.
df_slice = df.reindex(columns = ['b'])print(df_slice)
Code explanation
Line 1: We create a new variable
df_sliceto store the subset of the DataFrame from the original DataFramedfby using thereindexmethod. Thecolumnsparameter of thereindexmethod is set to['b'], which means that the new DataFramedf_slicewill only contain thebcolumn from the original DataFrame.Line 2: We print the new DataFrame
df_sliceto the console using theprint()statement.
Slicing multiple columns using reindex
Slicing multiple columns using reindex can be useful in situations where we want to extract multiple columns and retain their original order.
Code example
The code below selects the columns c and a from the original DataFrame df and stores them to the new DataFrame df_slice.
df_slice = df.reindex(columns = ['c','a'])print(df_slice)
Code explanation
Line 1: We create a new variable
df_sliceto store the subset of the DataFrame from the original DataFramedfby using thereindexmethod. Thecolumnsparameter of thereindexmethod is set to['c','a'], which means that the new DataFramedf_slicewill contain two columns,canda, from the original DataFrame.Line 2: We print the new DataFrame
df_sliceto the console using theprint()statement.
Slicing a column using the [ ] notation
With this simple method, we can use the [ ] single notation for 1-d arrays and the [[ ]] double notation for 2-d arrays, and pass the column's name as a string.
Code example
The code below selects the columns c and d using the indexing system from the original DataFrame df and stores them to the new DataFrame df_slice.
df_slice = df[['c','d']]print (df_slice)
Code explanation
Line 1: We create a new variable
df_sliceto store the subset of the DataFrame from the original DataFramedfby using the[]method. Thecolumnsparameter is set to['c','d'], which means that the new DataFramedf_slicewill contain two columns,candd, from the original DataFrame.Line 2: We print the new DataFrame
df_sliceto the console using theprint()statement.
Slicing a column using the .loc[ ] method with step size 2
The pandas library includes a method called .loc[ ] that enables the indexing-based slicing of a DataFrame. With this method, we can access a specific group of rows and columns from a DataFrame using their labels.
Code example
The code below creates a new DataFrame df_slice by selecting the columns a and d from the original DataFrame df, using the loc indexing syntax with a step size of 2.
df_slice = df.loc[:, 'a':'d':2]print(df_slice)
Code explanation
Line 1: We create a new variable named
df_sliceto store a subset of a pandas DataFrame. The:on the left side of the comma specifies that we want to select all the rows of the DataFrame, and'a':'d':2on the right side of the comma specifies that we want to select columns with labels between, and including,aanddbut only for every second column.Line 2: We print the subset of the original DataFrame, which contains only the columns
aandd.
Slicing a column using the .iloc[]method with step size 1
pandas also includes a method called .iloc[] that allows indexing-based slicing of a DataFrame. This method is particularly helpful when the DataFrame has an index label that is not a numeric or when the user is unsure about the index label.
Code example
The code below creates a new DataFrame df_slice by selecting columns 0, 1, and 2 from the original DataFrame df, using the .iloc indexing syntax with a step size of 1.
df_slice = df.iloc[:,0:3:1]print(df_slice)
Code explanation
Line 1: We create a new variable named
df_sliceto store a subset of a pandas DataFrame. The:on the left side of the comma specifies that we want to select all the rows of the DataFrame, and0:3:1on the right side of the comma specifies that we want to select columns with integer positions between0(inclusive) and3(exclusive), in steps of1.Line 2: We print the subset of the original DataFrame, which contains only the first three columns.