How to install Pandas library in Python on Linux

Pandas is a library that helps to store and organize data in rows and columns. We can clean up messy data and plot it with the help of this library. It uses two data structures to store the data: DataFrame and series. DataFrame stores two-dimensional data, whereas series helps to store one-dimensional data in the form of an array.

Installation of Pandas

To install Pandas, you need to install pip on Linux. The scripts to install pip are given below.

https://bootstrap.pypa.io/get-pip.py
cd ~/Downloads
python get-pip.py
pip --version
Commands for pip installation

If you have already installed pip, you can skip running this command. The next step is to install the Pandas library, for which the command is as follows.

pip install pandas

After the successful installation of pandas, your screen will be like this.

Pip installed successfuly
Pip installed successfuly

Now you installed Pandas on your system. Let us discuss some functions provided by the Pandas library for data manipulation.

Functions of Pandas

Pandas provides data manipulation and analysis functions in Python, including creating series and data frames, reading and writing data from various formats, exploring data, filtering, grouping, and sorting. Let us have a look at the notable functions of Pandas.

Create DataFrame

The code below creates a DataFrame named df using a dictionary of lists containing Name, Age, and Salary data.

import pandas as pd
# Create a DataFrame using a dictionary of lists
data = {
'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],
'Age': [25, 30, 22, 28, 35],
'Salary': [50000, 60000, 45000, 70000, 80000]
}
df = pd.DataFrame(data)
# Display the DataFrame
print("DataFrame:")
print(df)

Access columns and rows

We can access the columns using df['Name'] and access the rows using df.loc[2] in Pandas. Look at the output below for better visualization.

import pandas as pd
data = {
'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],
'Age': [25, 30, 22, 28, 35],
'Salary': [50000, 60000, 45000, 70000, 80000]
}
df = pd.DataFrame(data)
# Accessing columns and rows
print("\nAccessing Columns:")
print(df['Name'])
print("\nAccessing Rows:")
print(df.loc[2])

Data summary

We can print a summary of the numerical columns in the DataFrame df using the describe() function. It displays statistics such as count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values for each numerical column.

import pandas as pd
# Create a DataFrame using a dictionary of lists
data = {
'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],
'Age': [25, 30, 22, 28, 35],
'Salary': [50000, 60000, 45000, 70000, 80000]
}
df = pd.DataFrame(data)
# Data Summary
print("\nData Summary:")
print(df.describe())

Adding columns

We can create a new column. In this case, the newly added column Gender is assigned the values ['Male', 'Female', 'Male', 'Female', 'Male'] corresponding to the existing rows in the DataFrame.

import pandas as pd
# Create a DataFrame using a dictionary of lists
data = {
'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],
'Age': [25, 30, 22, 28, 35],
'Salary': [50000, 60000, 45000, 70000, 80000]
}
df = pd.DataFrame(data)
# Adding New Columns
df['Gender'] = ['Male', 'Female', 'Male', 'Female', 'Male']
print("\nDataFrame with New Column:")
print(df)

Grouping and aggregation

The below code performs grouping and aggregation on the DataFrame df based on the Gender column. It calculates the mean of Salary and the maximum age for each gender, storing the result in the grouped_data DataFrame.

import pandas as pd
# Create a DataFrame using a dictionary of lists
data = {
'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],
'Age': [25, 30, 22, 28, 35],
'Salary': [50000, 60000, 45000, 70000, 80000]
}
df = pd.DataFrame(data)
df['Gender'] = ['Male', 'Female', 'Male', 'Female', 'Male']
# Grouping and Aggregation
grouped_data = df.groupby('Gender').agg({'Salary': 'mean', 'Age': 'max'})
print("\nGrouped and Aggregated Data:")
print(grouped_data)

Filtering data

We can apply a filter to the DataFrame df based on the Age column, only keeping rows where the Age is greater than 25.

import pandas as pd
# Create a DataFrame using a dictionary of lists
data = {
'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],
'Age': [25, 30, 22, 28, 35],
'Salary': [50000, 60000, 45000, 70000, 80000]
}
df = pd.DataFrame(data)
# Filtering Data
print("\nFiltered Data:")
filtered_data = df[df['Age'] > 25]
print(filtered_data)

Sorting data

The given code sorts the DataFrame df based on the Salary column in descending order and stores the sorted data in the sorted_data DataFrame.

import pandas as pd
# Create a DataFrame using a dictionary of lists
data = {
'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],
'Age': [25, 30, 22, 28, 35],
'Salary': [50000, 60000, 45000, 70000, 80000]
}
df = pd.DataFrame(data)
# Sorting Data
sorted_data = df.sort_values(by='Salary', ascending=False)
print("\nSorted Data:")
print(sorted_data)

Data visualization

With the help of Pandas, we can create a bar plot from the DataFrame df with the Name column on the x-axis and the Salary column on the y-axis. Look at the output of the code to visualize the plot.

import pandas as pd
import matplotlib.pyplot as plt
# Create a DataFrame using a dictionary of lists
data = {
'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],
'Age': [25, 30, 22, 28, 35],
'Salary': [50000, 60000, 45000, 70000, 80000]
}
df = pd.DataFrame(data)
# Data Visualization
df.plot(x='Name', y='Salary', kind='bar', title='Salary Distribution')
plt.savefig("./output/Plot.png")
plt.show()

Conclusion

Pandas make working with data easy and fun regardless of how experienced you are. With Pandas, you can explore and analyze information and get valuable insights from your datasets. So, whether you're working with financial data, healthcare records, or social media insights, Pandas is the ultimate companion ensuring you can handle any data challenge confidently and efficiently.

Copyright ©2024 Educative, Inc. All rights reserved