How to install Pandas library in Python on Linux
Pandas is a library that helps to store and organize data in rows and columns. We can clean up messy data and plot it with the help of this library. It uses two data structures to store the data: DataFrame and series. DataFrame stores two-dimensional data, whereas series helps to store one-dimensional data in the form of an array.
Installation of Pandas
To install Pandas, you need to install pip on Linux. The scripts to install pip are given below.
https://bootstrap.pypa.io/get-pip.pycd ~/Downloadspython get-pip.pypip --version
If you have already installed pip, you can skip running this command. The next step is to install the Pandas library, for which the command is as follows.
pip install pandas
After the successful installation of pandas, your screen will be like this.
Now you installed Pandas on your system. Let us discuss some functions provided by the Pandas library for data manipulation.
Functions of Pandas
Pandas provides data manipulation and analysis functions in Python, including creating series and data frames, reading and writing data from various formats, exploring data, filtering, grouping, and sorting. Let us have a look at the notable functions of Pandas.
Create DataFrame
The code below creates a DataFrame named df using a dictionary of lists containing Name, Age, and Salary data.
import pandas as pd# Create a DataFrame using a dictionary of listsdata = {'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [25, 30, 22, 28, 35],'Salary': [50000, 60000, 45000, 70000, 80000]}df = pd.DataFrame(data)# Display the DataFrameprint("DataFrame:")print(df)
Access columns and rows
We can access the columns using df['Name'] and access the rows using df.loc[2] in Pandas. Look at the output below for better visualization.
import pandas as pddata = {'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [25, 30, 22, 28, 35],'Salary': [50000, 60000, 45000, 70000, 80000]}df = pd.DataFrame(data)# Accessing columns and rowsprint("\nAccessing Columns:")print(df['Name'])print("\nAccessing Rows:")print(df.loc[2])
Data summary
We can print a summary of the numerical columns in the DataFrame df using the describe() function. It displays statistics such as count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values for each numerical column.
import pandas as pd# Create a DataFrame using a dictionary of listsdata = {'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [25, 30, 22, 28, 35],'Salary': [50000, 60000, 45000, 70000, 80000]}df = pd.DataFrame(data)# Data Summaryprint("\nData Summary:")print(df.describe())
Adding columns
We can create a new column. In this case, the newly added column Gender is assigned the values ['Male', 'Female', 'Male', 'Female', 'Male'] corresponding to the existing rows in the DataFrame.
import pandas as pd# Create a DataFrame using a dictionary of listsdata = {'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [25, 30, 22, 28, 35],'Salary': [50000, 60000, 45000, 70000, 80000]}df = pd.DataFrame(data)# Adding New Columnsdf['Gender'] = ['Male', 'Female', 'Male', 'Female', 'Male']print("\nDataFrame with New Column:")print(df)
Grouping and aggregation
The below code performs grouping and aggregation on the DataFrame df based on the Gender column. It calculates the mean of Salary and the maximum age for each gender, storing the result in the grouped_data DataFrame.
import pandas as pd# Create a DataFrame using a dictionary of listsdata = {'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [25, 30, 22, 28, 35],'Salary': [50000, 60000, 45000, 70000, 80000]}df = pd.DataFrame(data)df['Gender'] = ['Male', 'Female', 'Male', 'Female', 'Male']# Grouping and Aggregationgrouped_data = df.groupby('Gender').agg({'Salary': 'mean', 'Age': 'max'})print("\nGrouped and Aggregated Data:")print(grouped_data)
Filtering data
We can apply a filter to the DataFrame df based on the Age column, only keeping rows where the Age is greater than 25.
import pandas as pd# Create a DataFrame using a dictionary of listsdata = {'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [25, 30, 22, 28, 35],'Salary': [50000, 60000, 45000, 70000, 80000]}df = pd.DataFrame(data)# Filtering Dataprint("\nFiltered Data:")filtered_data = df[df['Age'] > 25]print(filtered_data)
Sorting data
The given code sorts the DataFrame df based on the Salary column in descending order and stores the sorted data in the sorted_data DataFrame.
import pandas as pd# Create a DataFrame using a dictionary of listsdata = {'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [25, 30, 22, 28, 35],'Salary': [50000, 60000, 45000, 70000, 80000]}df = pd.DataFrame(data)# Sorting Datasorted_data = df.sort_values(by='Salary', ascending=False)print("\nSorted Data:")print(sorted_data)
Data visualization
With the help of Pandas, we can create a bar plot from the DataFrame df with the Name column on the x-axis and the Salary column on the y-axis. Look at the output of the code to visualize the plot.
import pandas as pdimport matplotlib.pyplot as plt# Create a DataFrame using a dictionary of listsdata = {'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [25, 30, 22, 28, 35],'Salary': [50000, 60000, 45000, 70000, 80000]}df = pd.DataFrame(data)# Data Visualizationdf.plot(x='Name', y='Salary', kind='bar', title='Salary Distribution')plt.savefig("./output/Plot.png")plt.show()
Conclusion
Pandas make working with data easy and fun regardless of how experienced you are. With Pandas, you can explore and analyze information and get valuable insights from your datasets. So, whether you're working with financial data, healthcare records, or social media insights, Pandas is the ultimate companion ensuring you can handle any data challenge confidently and efficiently.
Free Resources