Search⌘ K
AI Features

Introduction to pandas

Explore how pandas enables efficient data manipulation with its Series and DataFrame structures. Understand core functions for loading, viewing, and summarizing data, and prepare datasets for visualization using Seaborn.

Why pandas?

pandas is an open-source Python library that provides efficient data manipulation and analysis tools. It offers a variety of data structures and procedures along with support for various data formats. It’s built on top of Python’s NumPy library.

The following key features are why pandas is a popular and commonly used library:

  • Fast, efficient, and optimal performance and support for big data.
  • Support for various data formats such as CSV files, JSON, XML, and SQL databases (to name a few).
  • Data cleaning and support for handling missing values.

Pandas data structures

Python’s pandas library provides support for the following two data structures:

  • pandas Series
  • pandas DataFrame

The pandas series

The pandas Series object is a one-dimensional labeled array. We can populate a Series object with any Python data type, such as integers, strings, floats, and so on. We can think of the Series object as a column in a spreadsheet. All Series objects are indexed by default, meaning that every Series element has an index.

We can create a Series object using an array, dictionary, lists, and scalars. As illustrated in the figure below, we can make the scores Series object by passing the student_score list to the pd.Series() function. Each element is indexed, and we can access any element using its index. Here, we use the default alias pd to refer to the pandas library.

The pandas DataFrame

The pandas DataFrame object is a two-dimensional data structure that organizes data in the form of rows and columns. A DataFrame can be populated using dictionaries, lists, SQL databases, CSV files, and so on. We can think of a DataFrame as a table in a spreadsheet. A DataFrame is indexed by default, and we can access each row using its index.

The creation of the DataFrame object is demonstrated in the figure below. We create the df DataFrame object by passing the data dictionary to the pd.DataFrame() function.

By default, both the Series and DataFrame objects are indexed. However, we can also specify our custom index for the elements in the index argument of the pd.Series() and pd.DataFrame() functions.

Hands-on data analysis with pandas

We’ll use the CustomerLaptopTransactionData.csv dataset stored in CSV format. To load the dataset, we use the read_csv() function, as illustrated in the code snippet below. We load the dataset in a pandas DataFrame, named transactions_df:

Python
import pandas as pd
transactions_df = pd.read_csv('CustomerLaptopTransactionData.csv') # load data

The first step after loading a dataset into a DataFrame is to see how the dataset looks. To view the records from the DataFrame, we use the pandas head() function, which displays the first five records.

Python
print(transactions_df.head())

Likewise, we can use the tail() function to get the last nn records from the dataset. By default, it returns the last five records.

Python
print(transactions_df.tail())

We can pass any number within the range of the DataFrame records in the head() and tail() functions. But how do we know the number of records in a DataFrame? Again, pandas has got us covered. We can find this using the DataFrame shape attribute. It returns a tuple (number of rows, number of columns), so the dataset has 86 rows and seven columns.

Python
print(transactions_df.shape)

Similarly, the DataFrame’s size attribute returns an integer value representing the number of elements (the number of rows times the number of columns).

Python
print(transactions_df.size)

The next step in knowing your dataset is to learn about its column data types. For this, we use the pandas DataFrame dtype attribute.

Python
print(transactions_df.dtypes)

In the code snippet above, Purchase_ID and laptop_price are integers and the rest of the columns hold float data.

Once we’ve learned about the dataset’s size, columns, and columns types, it’s good to know about the data distribution. To achieve this, we use the describe() method, which returns a statistical summary for numerical columns present in the dataset.

Python
print(transactions_df.describe())

The describe() method computes some statistical measurements for the numerical values of the DataFrame, such as percentile, mean, and standard deviation.

We’ve seen the laptop_brand column in the dataset. Let’s say we’re curious about how many different brands of laptops are sold in that particular store—we’re interested to know how many unique values exist for the laptop_brand column. We can figure this out using the nunique() method, which returns the number of unique values in a particular column. It comes in handy, particularly when we want to explore any categorical column.

Python
print(transactions_df.nunique()) # number of unique values in each column

We can see that there are two kinds of laptop_brand sold. Notice that categorical variables are represented with numerical values here, meaning one value can refer to one brand, and another value may refer to another brand. This is just a convention. We can also have any data that has stored these values (laptop_ brand) in the form of strings.

We’ve viewed and accessed the DataFrame as a whole. However, we can also access any specific columns we may want to explore. For example, we can access the laptop_price column using the [] notation, as shown in the code widget below. This is because we use square brackets for indexing columns in pandas.

Python
print("Single bracket\n", transactions_df['laptop_price'])
print("\n")
print("Double brackets\n", transactions_df[['laptop_price']])

In the code snippet above, in line 1 we use [] notation, and in line 3 we use [[ ]] notation. Do you see a difference between both outputs? Take a second here to pause and think. Does something look familiar?

The first method, [] by default, returns a Series object, while the second method, [[]], returns a DataFrame object, even if you pass a list containing a single item. The output of the first method shows us a Series containing 86 rows, and the second method returns a DataFrame with 86 rows and one column.

We can also check the number of values for each category. We have seen before that the laptop_price column had four unique values: $500, $600, $900, and $1200. But we want to know how many laptops were sold for each unique category or price. For this, we use the value_counts() method.

Python
print(transactions_df['laptop_price'].value_counts())

We can see that the laptops priced at $900 were sold the most. Do you see how analyzing just one column can give us insight?

The pandas library is often used together with the seaborn library for visualizations. We use pandas for data cleaning and modifications to make it seaborn-ready so that the visualizations convey the most accurate information.