Pandas is a library that helps to store and organize data in rows and columns. We can clean up messy data and plot it with the help of this library. It uses two data structures to store the data: DataFrame and series. DataFrame stores two-dimensional data, whereas series helps to store one-dimensional data in the form of an array.
To install Pandas, you need to install pip on Linux. The scripts to install pip are given below.
https://bootstrap.pypa.io/get-pip.pycd ~/Downloadspython get-pip.pypip --version
If you have already installed pip, you can skip running this command. The next step is to install the Pandas library, for which the command is as follows.
pip install pandas
After the successful installation of pandas, your screen will be like this.
Now you installed Pandas on your system. Let us discuss some functions provided by the Pandas library for data manipulation.
Pandas provides data manipulation and analysis functions in Python, including creating series and data frames, reading and writing data from various formats, exploring data, filtering, grouping, and sorting. Let us have a look at the notable functions of Pandas.
The code below creates a DataFrame named df
using a dictionary of lists containing Name
, Age
, and Salary
data.
import pandas as pd# Create a DataFrame using a dictionary of listsdata = {'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [25, 30, 22, 28, 35],'Salary': [50000, 60000, 45000, 70000, 80000]}df = pd.DataFrame(data)# Display the DataFrameprint("DataFrame:")print(df)
We can access the columns using df['Name']
and access the rows using df.loc[2]
in Pandas. Look at the output below for better visualization.
import pandas as pddata = {'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [25, 30, 22, 28, 35],'Salary': [50000, 60000, 45000, 70000, 80000]}df = pd.DataFrame(data)# Accessing columns and rowsprint("\nAccessing Columns:")print(df['Name'])print("\nAccessing Rows:")print(df.loc[2])
We can print a summary of the numerical columns in the DataFrame df
using the describe()
function. It displays statistics such as count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values for each numerical column.
import pandas as pd# Create a DataFrame using a dictionary of listsdata = {'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [25, 30, 22, 28, 35],'Salary': [50000, 60000, 45000, 70000, 80000]}df = pd.DataFrame(data)# Data Summaryprint("\nData Summary:")print(df.describe())
We can create a new column. In this case, the newly added column Gender
is assigned the values ['Male', 'Female', 'Male', 'Female', 'Male']
corresponding to the existing rows in the DataFrame.
import pandas as pd# Create a DataFrame using a dictionary of listsdata = {'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [25, 30, 22, 28, 35],'Salary': [50000, 60000, 45000, 70000, 80000]}df = pd.DataFrame(data)# Adding New Columnsdf['Gender'] = ['Male', 'Female', 'Male', 'Female', 'Male']print("\nDataFrame with New Column:")print(df)
The below code performs grouping and aggregation on the DataFrame df
based on the Gender
column. It calculates the mean of Salary
and the maximum age for each gender, storing the result in the grouped_data
DataFrame.
import pandas as pd# Create a DataFrame using a dictionary of listsdata = {'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [25, 30, 22, 28, 35],'Salary': [50000, 60000, 45000, 70000, 80000]}df = pd.DataFrame(data)df['Gender'] = ['Male', 'Female', 'Male', 'Female', 'Male']# Grouping and Aggregationgrouped_data = df.groupby('Gender').agg({'Salary': 'mean', 'Age': 'max'})print("\nGrouped and Aggregated Data:")print(grouped_data)
We can apply a filter to the DataFrame df
based on the Age
column, only keeping rows where the Age
is greater than 25.
import pandas as pd# Create a DataFrame using a dictionary of listsdata = {'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [25, 30, 22, 28, 35],'Salary': [50000, 60000, 45000, 70000, 80000]}df = pd.DataFrame(data)# Filtering Dataprint("\nFiltered Data:")filtered_data = df[df['Age'] > 25]print(filtered_data)
The given code sorts the DataFrame df
based on the Salary
column in descending order and stores the sorted data in the sorted_data
DataFrame.
import pandas as pd# Create a DataFrame using a dictionary of listsdata = {'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [25, 30, 22, 28, 35],'Salary': [50000, 60000, 45000, 70000, 80000]}df = pd.DataFrame(data)# Sorting Datasorted_data = df.sort_values(by='Salary', ascending=False)print("\nSorted Data:")print(sorted_data)
With the help of Pandas, we can create a bar plot from the DataFrame df
with the Name
column on the x-axis
and the Salary
column on the y-axis
. Look at the output of the code to visualize the plot.
import pandas as pdimport matplotlib.pyplot as plt# Create a DataFrame using a dictionary of listsdata = {'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [25, 30, 22, 28, 35],'Salary': [50000, 60000, 45000, 70000, 80000]}df = pd.DataFrame(data)# Data Visualizationdf.plot(x='Name', y='Salary', kind='bar', title='Salary Distribution')plt.savefig("./output/Plot.png")plt.show()
Pandas make working with data easy and fun regardless of how experienced you are. With Pandas, you can explore and analyze information and get valuable insights from your datasets. So, whether you're working with financial data, healthcare records, or social media insights, Pandas is the ultimate companion ensuring you can handle any data challenge confidently and efficiently.